1
00:00:09,679 --> 00:00:13,891
- Hello? Okay, it's after 12,
so I want to get started.

2
00:00:13,891 --> 00:00:17,822
So today, lecture eight, we're going
to talk about deep learning software.

3
00:00:17,822 --> 00:00:21,283
This is a super exciting topic
because it changes a lot every year.

4
00:00:21,283 --> 00:00:25,621
But also means it's a lot of work to give this
lecture 'cause it changes a lot every year.

5
00:00:25,621 --> 00:00:30,024
But as usual, a couple administrative
notes before we dive into the material.

6
00:00:30,024 --> 00:00:34,563
So as a reminder the project proposals for
your course projects were due on Tuesday.

7
00:00:34,563 --> 00:00:42,766
So hopefully you all turned that in, and hopefully you all have a somewhat
good idea of what kind of projects you want to work on for the class.

8
00:00:42,766 --> 00:00:50,217
So we're in the process of assigning TA's to projects based
on what the project area is and the expertise of the TA's.

9
00:00:50,217 --> 00:00:54,264
So we'll have some more information about
that in the next couple days I think.

10
00:00:54,264 --> 00:00:56,563
We're also in the process
of grading assignment one,

11
00:00:56,563 --> 00:01:00,942
so stay tuned and we'll get those
grades back to you as soon as we can.

12
00:01:00,942 --> 00:01:08,680
Another reminder is that assignment two has been out for a while.
That's going to be due next week, a week from today, Thursday.

13
00:01:08,680 --> 00:01:16,231
And again, when working on assignment two, remember to stop your Google
Cloud instances when you're not working to try to preserve your credits.

14
00:01:16,231 --> 00:01:24,812
And another bit of confusion, I just wanted to re-emphasize is that for
assignment two you really only need to use GPU instances for the last notebook.

15
00:01:24,812 --> 00:01:32,250
For all of the several notebooks it's just in Python and
Numpy so you don't need any GPUs for those questions.

16
00:01:32,250 --> 00:01:36,701
So again, conserve your credits,
only use GPUs when you need them.

17
00:01:36,701 --> 00:01:39,973
And the final reminder is
that the midterm is coming up.

18
00:01:39,973 --> 00:01:45,683
It's kind of hard to believe we're there already, but
the midterm will be in class on Tuesday, five nine.

19
00:01:45,683 --> 00:01:47,901
So the midterm will be more theoretical.

20
00:01:47,901 --> 00:01:57,071
It'll be sort of pen and paper working through different kinds of, slightly more theoretical
questions to check your understanding of the material that we've covered so far.

21
00:01:57,071 --> 00:02:02,506
And I think we'll probably post at least a short
sort of sample of the types of questions to expect.

22
00:02:02,506 --> 00:02:03,695
Question?

23
00:02:03,695 --> 00:02:05,310
[student's words obscured
due to lack of microphone]

24
00:02:05,310 --> 00:02:10,675
Oh yeah, question is whether it's open-book, so
we're going to say closed note, closed book.

25
00:02:10,675 --> 00:02:15,671
So just, Yeah, yeah, so that's what we've done in the
past is just closed note, closed book, relatively

26
00:02:15,671 --> 00:02:21,735
just like want to check that you understand the
intuition behind most of the stuff we've presented.

27
00:02:23,618 --> 00:02:27,577
So, a quick recap as a reminder of
what we were talking about last time.

28
00:02:27,577 --> 00:02:29,737
Last time we talked about
fancier optimization algorithms

29
00:02:29,737 --> 00:02:34,975
for deep learning models including SGD
Momentum, Nesterov, RMSProp and Adam.

30
00:02:34,975 --> 00:02:45,492
And we saw that these relatively small tweaks on top of vanilla SGD, are
relatively easy to implement but can make your networks converge a bit faster.

31
00:02:45,492 --> 00:02:48,529
We also talked about regularization,
especially dropout.

32
00:02:48,529 --> 00:02:56,975
So remember dropout, you're kind of randomly setting parts of the network to zero during the
forward pass, and then you kind of marginalize out over that noise in the back at test time.

33
00:02:56,975 --> 00:03:02,805
And we saw that this was kind of a general pattern across many different
types of regularization in deep learning, where you might add some kind

34
00:03:02,805 --> 00:03:08,415
of noise during training, but then marginalize out that
noise at test time so it's not stochastic at test time.

35
00:03:08,415 --> 00:03:15,376
We also talked about transfer learning where you can maybe download big networks
that were pre-trained on some dataset and then fine tune them for your own problem.

36
00:03:15,376 --> 00:03:21,314
And this is one way that you can attack a lot of problems in deep
learning, even if you don't have a huge dataset of your own.

37
00:03:22,781 --> 00:03:29,615
So today we're going to shift gears a little bit and talk about some of
the nuts and bolts about writing software and how the hardware works.

38
00:03:29,615 --> 00:03:36,276
And a little bit, diving into a lot of details about what the software
looks like that you actually use to train these things in practice.

39
00:03:36,276 --> 00:03:43,967
So we'll talk a little bit about CPUs and GPUs and then we'll talk about several
of the major deep learning frameworks that are out there in use these days.

40
00:03:45,471 --> 00:03:52,961
So first, we've sort of mentioned this off hand a bunch of
different times, that computers have CPUs, computers have GPUs.

41
00:03:52,961 --> 00:04:02,655
Deep learning uses GPUs, but we weren't really too explicit up to this point about what
exactly these things are and why one might be better than another for different tasks.

42
00:04:02,655 --> 00:04:06,472
So, who's built a computer before?
Just kind of show of hands.

43
00:04:06,472 --> 00:04:10,965
So, maybe about a third of you, half of
you, somewhere around that ballpark.

44
00:04:10,965 --> 00:04:15,174
So this is a shot of my computer at home
that I built.

45
00:04:15,174 --> 00:04:22,261
And you can see that there's a lot of stuff going on inside the
computer, maybe, hopefully you know what most of these parts are.

46
00:04:22,261 --> 00:04:25,594
And the CPU is the
Central Processing Unit.

47
00:04:25,594 --> 00:04:31,391
That's this little chip hidden under this cooling
fan right here near the top of the case.

48
00:04:31,391 --> 00:04:39,555
And the CPU is actually relatively small piece. It's a relatively
small thing inside the case. It's not taking up a lot of space.

49
00:04:39,555 --> 00:04:46,221
And the GPUs are these two big monster things that are
taking up a gigantic amount of space in the case.

50
00:04:46,221 --> 00:04:50,296
They have their own cooling, they're taking
a lot of power. They're quite large.

51
00:04:50,296 --> 00:04:59,139
So, just in terms of how much power they're using, in terms of how big they are,
the GPUs are kind of physically imposing and taking up a lot of space in the case.

52
00:04:59,139 --> 00:05:04,516
So the question is what are these things and
why are they so important for deep learning?

53
00:05:04,516 --> 00:05:08,937
Well, the GPU is called a graphics card,
or Graphics Processing Unit.

54
00:05:08,937 --> 00:05:16,166
And these were really developed, originally for rendering computer
graphics, and especially around games and that sort of thing.

55
00:05:16,166 --> 00:05:23,247
So another show of hands, who plays video games at
home sometimes, from time to time on their computer?

56
00:05:23,247 --> 00:05:25,693
Yeah, so again, maybe
about half, good fraction.

57
00:05:25,693 --> 00:05:32,196
So for those of you who've played video games before and who've built
your own computers, you probably have your own opinions on this debate.

58
00:05:32,196 --> 00:05:34,095
[laughs]

59
00:05:34,095 --> 00:05:37,666
So this is one of those big
debates in computer science.

60
00:05:37,666 --> 00:05:42,620
You know, there's like Intel versus AMD,
NVIDIA versus AMD for graphics cards.

61
00:05:42,620 --> 00:05:45,394
It's up there with Vim
versus Emacs for text editor.

62
00:05:45,394 --> 00:05:51,945
And pretty much any gamer has their own opinions on which
of these two sides they prefer for their own cards.

63
00:05:51,945 --> 00:05:59,116
And in deep learning we kind of have mostly
picked one side of this fight, and that's NVIDIA.

64
00:05:59,116 --> 00:06:05,117
So if you guys have AMD cards, you might be in a little bit
more trouble if you want to use those for deep learning.

65
00:06:05,117 --> 00:06:08,812
And really, NVIDIA's been pushing a lot for
deep learning in the last several years.

66
00:06:08,812 --> 00:06:11,997
It's been kind of a large focus
of some of their strategy.

67
00:06:11,997 --> 00:06:19,354
And they put in a lot effort into engineering sort of good
solutions to make their hardware better suited for deep learning.

68
00:06:19,354 --> 00:06:27,718
So most people in deep learning when we talk about GPUs,
we're pretty much exclusively talking about NVIDIA GPUs.

69
00:06:27,718 --> 00:06:35,268
Maybe in the future this'll change a little bit, and there might be new
players coming up, but at least for now NVIDIA is pretty dominant.

70
00:06:35,268 --> 00:06:41,705
So to give you an idea of like what is the difference between a
CPU and a GPU, I've kind of made a little spread sheet here.

71
00:06:41,705 --> 00:06:52,079
On the top we have two of the kind of top end Intel consumer CPUs, and on
the bottom we have two of NVIDIA's sort of current top end consumer GPUs.

72
00:06:52,079 --> 00:06:55,975
And there's a couple general
trends to notice here.

73
00:06:55,975 --> 00:07:03,284
Both GPUs and CPUs are kind of a general purpose computing machine
where they can execute programs and do sort of arbitrary instructions,

74
00:07:03,284 --> 00:07:05,987
but they're qualitatively
pretty different.

75
00:07:05,987 --> 00:07:16,714
So CPUs tend to have just a few cores, for consumer desktop CPUs these
days, they might have something like four or six or maybe up to 10 cores.

76
00:07:16,714 --> 00:07:24,893
With hyperthreading technology that means they can run, the hardware
can physically run, like maybe eight or up to 20 threads concurrently.

77
00:07:24,893 --> 00:07:29,700
So the CPU can maybe do 20
things in parallel at once.

78
00:07:29,700 --> 00:07:34,527
So that's just not a gigantic number, but
those threads for a CPU are pretty powerful.

79
00:07:34,527 --> 00:07:37,223
They can actually do a lot of things,
they're very fast.

80
00:07:37,223 --> 00:07:43,011
Every CPU instruction can actually do quite a lot of
stuff. And they can all work pretty independently.

81
00:07:43,011 --> 00:07:51,909
For GPUs it's a little bit different. So for GPUs we see that these
sort of common top end consumer GPUs have thousands of cores.

82
00:07:51,909 --> 00:08:00,412
So the NVIDIA Titan XP which is the current top of the line
consumer GPU has 3840 cores. So that's a crazy number.

83
00:08:02,223 --> 00:08:06,357
That's like way more than the 10 cores that
you'll get for a similarly priced CPU.

84
00:08:06,357 --> 00:08:12,207
The downside of a GPU is that each of those
cores, one, it runs at a much slower clock speed.

85
00:08:12,207 --> 00:08:14,439
And two they really
can't do quite as much.

86
00:08:14,439 --> 00:08:19,680
You can't really compare CPU cores
and GPU cores apples to apples.

87
00:08:19,680 --> 00:08:22,510
The GPU cores can't really
operate very independently.

88
00:08:22,510 --> 00:08:29,297
They all kind of need to work together and sort of parallize one task
across many cores rather than each core totally doing its own thing.

89
00:08:29,297 --> 00:08:32,405
So you can't really compare
these numbers directly.

90
00:08:32,405 --> 00:08:41,370
But it should give you the sense that due to the large number of cores GPUs can sort of, are
really good for parallel things where you need to do a lot of things all at the same time,

91
00:08:41,370 --> 00:08:44,742
but those things are all
pretty much the same flavor.

92
00:08:44,742 --> 00:08:49,387
Another thing to point out between
CPUs and GPUs is this idea of memory.

93
00:08:49,387 --> 00:08:58,523
Right, so CPUs have some cache on the CPU, but that's relatively small
and the majority of the memory for your CPU is pulling from your

94
00:08:58,523 --> 00:09:06,589
system memory, the RAM, which will maybe be like eight, 12, 16,
32 gigabytes of RAM on a typical consumer desktop these days.

95
00:09:06,589 --> 00:09:10,646
Whereas GPUs actually have their
own RAM built into the chip.

96
00:09:12,055 --> 00:09:22,675
There's a pretty large bottleneck communicating between the RAM in your system and the GPU,
so the GPUs typically have their own relatively large block of memory within the card itself.

97
00:09:23,955 --> 00:09:33,481
And for the Titan XP, which again is maybe the current top of the line
consumer card, this thing has 12 gigabytes of memory local to the GPU.

98
00:09:33,481 --> 00:09:41,790
GPUs also have their own caching system where there are sort of multiple hierarchies
of caching between the 12 gigabytes of GPU memory and the actual GPU cores.

99
00:09:41,790 --> 00:09:46,908
And that's somewhat similar to the caching
hierarchy that you might see in a CPU.

100
00:09:47,985 --> 00:09:52,583
So, CPUs are kind of good for general purpose
processing. They can do a lot of different things.

101
00:09:52,583 --> 00:09:57,089
And GPUs are maybe more specialized for
these highly paralyzable algorithms.

102
00:09:57,089 --> 00:10:04,106
So the prototypical algorithm of something that works really really
well and is like perfectly suited to a GPU is matrix multiplication.

103
00:10:04,106 --> 00:10:14,348
So remember in matrix multiplication on the left we've got like a matrix composed of a bunch of rows.
We multiply that on the right by another matrix composed of a bunch of columns and then this produces

104
00:10:14,348 --> 00:10:25,009
another, a final matrix where each element in the output matrix is a dot product between one of the
rows and one of the columns of the two input matrices. And these dot products are all independent.

105
00:10:25,009 --> 00:10:33,653
Like you could imagine, for this output matrix you could split it up completely and have
each of those different elements of the output matrix all being computed in parallel

106
00:10:33,653 --> 00:10:38,289
and they all sort of are running the same computation
which is taking a dot product of these two vectors.

107
00:10:38,289 --> 00:10:44,177
But exactly where they're reading that data from is
from different places in the two input matrices.

108
00:10:44,177 --> 00:10:55,166
So you could imagine that for a GPU you can just like blast this out and have all of this elements of the
output matrix all computed in parallel and that could make this thing computer super super fast on GPU.

109
00:10:55,166 --> 00:11:04,940
So that's kind of the prototypical type of problem that like where a GPU is really well suited, where a
CPU might have to go in and step through sequentially and compute each of these elements one by one.

110
00:11:06,337 --> 00:11:13,829
That picture is a little bit of a caricature because CPUs these days
have multiple cores, they can do vectorized instructions as well,

111
00:11:13,829 --> 00:11:19,568
but still, for these like massively parallel
problems GPUs tend to have much better throughput.

112
00:11:19,568 --> 00:11:25,404
Especially when these matrices get really really big. And
by the way, convolution is kind of the same kind of story.

113
00:11:25,404 --> 00:11:36,359
Where you know in convolution we have this input tensor, we have this weight tensor and then every point in the output
tensor after a convolution is again some inner product between some part of the weights and some part of the input.

114
00:11:36,359 --> 00:11:43,354
And you can imagine that a GPU could really parallize this computation,
split it all up across the many cores and compute it very quickly.

115
00:11:43,354 --> 00:11:49,510
So that's kind of the general flavor of the types of problems
where GPUs give you a huge speed advantage over CPUs.

116
00:11:51,695 --> 00:11:55,498
So you can actually write programs
that run directly on GPUs.

117
00:11:55,498 --> 00:12:03,614
So NVIDIA has this CUDA abstraction that lets you write code
that kind of looks like C, but executes directly on the GPUs.

118
00:12:03,614 --> 00:12:05,484
But CUDA code is really really tricky.

119
00:12:05,484 --> 00:12:12,002
It's actually really tough to write CUDA code that's performant
and actually squeezes all the juice out of these GPUs.

120
00:12:12,002 --> 00:12:19,163
You have to be very careful managing the memory hierarchy and making sure you
don't have cache misses and branch mispredictions and all that sort of stuff.

121
00:12:19,163 --> 00:12:22,930
So it's actually really really hard to
write performant CUDA code on your own.

122
00:12:22,930 --> 00:12:32,537
So as a result NVIDIA has released a lot of libraries that implement common
computational primitives that are very very highly optimized for GPUs.

123
00:12:32,537 --> 00:12:40,610
So for example NVIDIA has a cuBLAS library that implements different kinds of
matrix multiplications and different matrix operations that are super optimized,

124
00:12:40,610 --> 00:12:46,438
run really well on GPU, get very close to sort
of theoretical peak hardware utilization.

125
00:12:46,438 --> 00:12:54,499
Similarly they have a cuDNN library which implements things like convolution,
forward and backward passes, batch normalization, recurrent networks,

126
00:12:54,499 --> 00:12:57,454
all these kinds of computational
primitives that we need in deep learning.

127
00:12:57,454 --> 00:13:03,842
NVIDIA has gone in there and released their own binaries that
compute these primitives very efficiently on NVIDIA hardware.

128
00:13:03,842 --> 00:13:09,624
So in practice, you tend not to end up writing
your own CUDA code for deep learning.

129
00:13:09,624 --> 00:13:14,173
You typically are just mostly calling into
existing code that other people have written.

130
00:13:14,173 --> 00:13:19,573
Much of which is the stuff which has been
heavily optimized by NVIDIA already.

131
00:13:19,573 --> 00:13:23,693
There's another sort of language called
OpenCL which is a bit more general.

132
00:13:23,693 --> 00:13:29,185
Runs on more than just NVIDIA GPUs,
can run on AMD hardware, can run on CPUs,

133
00:13:29,185 --> 00:13:43,938
but OpenCL, nobody's really spent a really large amount of effort and energy trying to get optimized deep
learning primitives for OpenCL, so it tends to be a lot less performant the super optimized versions in CUDA.

134
00:13:43,938 --> 00:13:51,839
So maybe in the future we might see a bit of a more open standard and we might
see this across many different more types of platforms, but at least for now,

135
00:13:51,839 --> 00:13:55,488
NVIDIA's kind of the main game
in town for deep learning.

136
00:13:55,488 --> 00:14:01,686
So you can check, there's a lot of different resources for learning
about how you can do GPU programming yourself. It's kind of fun.

137
00:14:01,686 --> 00:14:05,900
It's sort of a different paradigm of writing code
because it's this massively parallel architecture,

138
00:14:05,900 --> 00:14:08,023
but that's a bit beyond
the scope of this course.

139
00:14:08,023 --> 00:14:12,263
And again, you don't really need to write your own
CUDA code much in practice for deep learning.

140
00:14:12,263 --> 00:14:16,600
And in fact, I've never written my own
CUDA code for any research project, so,

141
00:14:16,600 --> 00:14:22,219
but it is kind of useful to know like how it works and what
are the basic ideas even if you're not writing it yourself.

142
00:14:23,488 --> 00:14:29,168
So if you want to look at kind of CPU GPU performance
in practice, I did some benchmarks last summer

143
00:14:29,168 --> 00:14:36,065
comparing a decent Intel CPU against a bunch of different
GPUs that were sort of near top of the line at that time.

144
00:14:38,747 --> 00:14:48,954
And these were my own benchmarks that you can find more details on GitHub, but
my findings were that for things like VGG 16 and 19, ResNets, various ResNets,

145
00:14:49,830 --> 00:14:57,114
then you typically see something like a 65 to 75 times
speed up when running the exact same computation

146
00:14:57,114 --> 00:15:00,984
on a top of the line GPU, in
this case a Pascal Titan X,

147
00:15:00,984 --> 00:15:08,604
versus a top of the line, well, not quite top of the
line CPU, which in this case was an Intel E5 processor.

148
00:15:08,604 --> 00:15:15,550
Although, I'd like to make one sort of caveat here is that you always
need to be super careful whenever you're reading any kind of benchmarks

149
00:15:15,550 --> 00:15:20,103
about deep learning, because it's super
easy to be unfair between different things.

150
00:15:20,103 --> 00:15:26,339
And you kind of need to know a lot of the details about what exactly is
being benchmarked in order to know whether or not the comparison is fair.

151
00:15:26,339 --> 00:15:35,855
So in this case I'll come right out and tell you that probably this comparison
is a little bit unfair to CPU because I didn't spend a lot of effort

152
00:15:35,855 --> 00:15:38,721
trying to squeeze the maximal performance
out of CPUs.

153
00:15:38,721 --> 00:15:42,483
I probably could have tuned the blast
libraries better for the CPU performance.

154
00:15:42,483 --> 00:15:44,540
And I probably could have gotten
these numbers a bit better.

155
00:15:44,540 --> 00:15:51,964
This was sort of out of the box performance between just installing
Torch, running it on a CPU, just installing Torch running it on a GPU.

156
00:15:51,964 --> 00:15:57,872
So this is kind of out of the box performance, but it's not
really like peak, possible, theoretical throughput on the CPU.

157
00:15:57,872 --> 00:16:02,422
But that being said, I think there are still
pretty substantial speed ups to be had here.

158
00:16:02,422 --> 00:16:15,543
Another kind of interesting outcome from this benchmarking was comparing these optimized cuDNN libraries
from NVIDIA for convolution and whatnot versus sort of more naive CUDA that had been hand written

159
00:16:15,543 --> 00:16:17,623
out in the open source community.

160
00:16:17,623 --> 00:16:24,653
And you can see that if you compare the same networks on the same hardware
with the same deep learning framework and the only difference is swapping out

161
00:16:24,653 --> 00:16:37,442
these cuDNN versus sort of hand written, less optimized CUDA you can see something like nearly a three X speed up
across the board when you switch from the relatively simple CUDA to these like super optimized cuDNN implementations.

162
00:16:37,442 --> 00:16:45,202
So in general, whenever you're writing code on GPU, you should probably almost
always like just make sure you're using cuDNN because you're leaving probably

163
00:16:45,202 --> 00:16:51,602
a three X performance boost on the table if
you're not calling into cuDNN for your stuff.

164
00:16:51,602 --> 00:17:02,882
So another problem that comes up in practice, when you're training these things is that you know, your model is maybe
sitting on the GPU, the weights of the model are in that 12 gigabytes of local storage on the GPU, but your big dataset

165
00:17:02,882 --> 00:17:07,243
is sitting over on the right on a hard
drive or an SSD or something like that.

166
00:17:07,243 --> 00:17:13,204
So if you're not careful you can actually bottleneck your
training by just trying to read the data off the disk.

167
00:17:14,321 --> 00:17:23,002
'Cause the GPU is super fast, it can compute forward and backward quite fast, but if you're
reading sequentially off a spinning disk, you can actually bottleneck your training quite,

168
00:17:23,002 --> 00:17:25,699
and that can be really
bad and slow you down.

169
00:17:25,700 --> 00:17:31,459
So some solutions here are that like you know if your dataset's really
small, sometimes you might just read the whole dataset into RAM.

170
00:17:31,459 --> 00:17:36,479
Or even if your dataset isn't so small, but you have a
giant server with a ton of RAM, you might do that anyway.

171
00:17:36,479 --> 00:17:42,917
You can also make sure you're using an SSD instead of a
hard drive, that can help a lot with read throughput.

172
00:17:42,917 --> 00:17:52,152
Another common strategy is to use multiple threads on the CPU that are
pre-fetching data off RAM or off disk, buffering it in memory, in RAM so that

173
00:17:52,152 --> 00:17:57,724
then you can continue feeding that buffer
data down to the GPU with good performance.

174
00:17:57,724 --> 00:18:08,804
This is a little bit painful to set up, but again like, these GPU's are so fast that if you're not really careful with
trying to feed them data as quickly as possible, just reading the data can sometimes bottleneck the whole training process.

175
00:18:08,804 --> 00:18:11,657
So that's something to be aware of.

176
00:18:11,657 --> 00:18:17,432
So that's kind of the brief introduction to like sort of GPU
CPU hardware in practice when it comes to deep learning.

177
00:18:17,432 --> 00:18:21,616
And then I wanted to switch gears a little bit
and talk about the software side of things.

178
00:18:21,616 --> 00:18:25,006
The various deep learning frameworks
that people are using in practice.

179
00:18:25,006 --> 00:18:28,819
But I guess before I move on, is there
any sort of questions about CPU GPU?

180
00:18:28,819 --> 00:18:30,519
Yeah, question?

181
00:18:30,519 --> 00:18:34,686
[student's words obscured
due to lack of microphone]

182
00:18:40,961 --> 00:18:45,854
Yeah, so the question is what can you sort of, what can you
do mechanically when you're coding to avoid these problems?

183
00:18:45,854 --> 00:18:50,833
Probably the biggest thing you can do in software
is set up sort of pre-fetching on the CPU.

184
00:18:50,833 --> 00:18:55,054
Like you couldn't like, sort of a naive thing would
be you have this sequential process where you

185
00:18:55,054 --> 00:18:58,791
first read data off disk, wait for the
data, wait for the minibatch to be read,

186
00:18:58,791 --> 00:19:02,458
then feed the minibatch to the GPU,
then go forward and backward on the GPU,

187
00:19:02,458 --> 00:19:05,442
then read another minibatch and
sort of do this all in sequence.

188
00:19:06,714 --> 00:19:15,469
And if you actually have multiple, like instead you might have CPU threads running
in the background that are fetching data off the disk such that while the,

189
00:19:15,469 --> 00:19:17,076
you can sort of interleave
all of these things.

190
00:19:17,076 --> 00:19:21,506
Like the GPU is computing, the CPU
background threads are feeding data off disk

191
00:19:21,506 --> 00:19:28,534
and your main thread is kind of waiting for these things to, just doing a bit
of synchronization between these things so they're all happening in parallel.

192
00:19:28,534 --> 00:19:38,016
And thankfully if you're using some of these deep learning frameworks that we're about to talk
about, then some of this work has already been done for you 'cause it's a little bit painful.

193
00:19:38,016 --> 00:19:41,738
So the landscape of deep learning
frameworks is super fast moving.

194
00:19:41,738 --> 00:19:47,915
So last year when I gave this lecture I talked
mostly about Caffe, Torch, Theano and TensorFlow.

195
00:19:47,915 --> 00:20:00,232
- And when I last gave this talk, again more than a year ago, TensorFlow was relatively new. - It had
not seen super widespread adoption yet at that time. But now I think in the last year TensorFlow

196
00:20:00,232 --> 00:20:06,310
has gotten much more popular. It's probably the main
framework of choice for many people. So that's a big change.

197
00:20:07,342 --> 00:20:12,282
We've also seen a ton of new frameworks sort
of popping up like mushrooms in the last year.

198
00:20:12,282 --> 00:20:18,052
So in particular Caffe2 and PyTorch are new frameworks
from Facebook that I think are pretty interesting.

199
00:20:18,052 --> 00:20:20,409
There's also a ton of other frameworks.

200
00:20:20,409 --> 00:20:24,089
Paddle, Baidu has Paddle,
Microsoft has CNTK,

201
00:20:24,089 --> 00:20:33,449
Amazon is mostly using MXNet and there's a ton of other frameworks as
well, but I'm less familiar with, and really don't have time to get into.

202
00:20:33,449 --> 00:20:43,572
But one interesting thing to point out from this picture is that kind of the first
generation of deep learning frameworks that really saw wide adoption were built in academia.

203
00:20:43,572 --> 00:20:49,388
So Caffe was from Berkeley, Torch was developed
originally NYU and also in collaboration with Facebook.

204
00:20:49,388 --> 00:20:52,077
And Theana was mostly build
at the University of Montreal.

205
00:20:52,077 --> 00:20:56,491
But these kind of next generation deep learning
frameworks all originated in industry.

206
00:20:56,491 --> 00:21:00,659
So Caffe2 is from Facebook, PyTorch is
from Facebook. TensorFlow is from Google.

207
00:21:00,659 --> 00:21:08,925
So it's kind of an interesting shift that we've seen in the landscape over the last
couple of years is that these ideas have really moved a lot from academia into industry.

208
00:21:08,925 --> 00:21:13,187
And now industry is kind of giving us these
big powerful nice frameworks to work with.

209
00:21:14,147 --> 00:21:24,850
So today I wanted to mostly talk about PyTorch and TensorFlow 'cause I personally think that those
are probably the ones you should be focusing on for a lot of research type problems these days.

210
00:21:24,850 --> 00:21:32,192
I'll also talk a bit about Caffe and Caffe2. But
probably a little bit less emphasis on those.

211
00:21:32,192 --> 00:21:36,705
And before we move any farther, I thought I should
make my own biases a little bit more explicit.

212
00:21:36,705 --> 00:21:43,501
So I have mostly, I've worked with Torch mostly for the last
several years. And I've used it quite a lot, I like it a lot.

213
00:21:43,501 --> 00:21:48,568
And then in the last year I've mostly switched
to PyTorch as my main research framework.

214
00:21:48,568 --> 00:21:52,306
So I have a little bit less experience with
some of these others, especially TensorFlow,

215
00:21:52,306 --> 00:21:58,382
but I'll still try to do my best to give you a fair
picture and a decent overview of these things.

216
00:21:58,382 --> 00:22:06,807
So, remember that in the last several lectures we've hammered
this idea of computational graphs in sort of over and over.

217
00:22:06,807 --> 00:22:13,176
That whenever you're doing deep learning, you want to think about building some
computational graph that computes whatever function that you want to compute.

218
00:22:13,176 --> 00:22:18,778
So in the case of a linear classifier you'll combine
your data X and your weights W with a matrix multiply.

219
00:22:18,778 --> 00:22:22,832
You'll do some kind of hinge loss
to maybe have, compute your loss.

220
00:22:22,832 --> 00:22:28,909
You'll have some regularization term and you imagine stitching
together all these different operations into some graph structure.

221
00:22:28,909 --> 00:22:36,167
Remember that these graph structures can get pretty complex in the case of a
big neural net, now there's many different layers, many different activations.

222
00:22:36,167 --> 00:22:39,687
Many different weights spread all
around in a pretty complex graph.

223
00:22:39,687 --> 00:22:47,328
And as you move to things like neural turing machines then you can get these really crazy
computational graphs that you can't even really draw because they're so big and messy.

224
00:22:48,349 --> 00:22:58,727
So the point of deep learning frameworks is really, there's really kind of three main reasons why
you might want to use one of these deep learning frameworks rather than just writing your own code.

225
00:22:58,727 --> 00:23:08,610
So the first would be that these frameworks enable you to easily build and work with these big hairy
computational graphs without kind of worrying about a lot of those bookkeeping details yourself.

226
00:23:08,610 --> 00:23:13,956
Another major idea is that, whenever we're working in
deep learning we always need to compute gradients.

227
00:23:14,812 --> 00:23:18,900
We're always computing some loss, we're always computer
gradient of our weight with respect to the loss.

228
00:23:18,900 --> 00:23:26,115
And we'd like to make this automatically computing gradient,
you don't want to have to write that code yourself.

229
00:23:26,115 --> 00:23:36,539
You want that framework to handle all these back propagation details for you so you can just think about writing down
the forward pass of your network and have the backward pass sort of come out for free without any additional work.

230
00:23:36,539 --> 00:23:42,000
And finally you want all this stuff to run efficiently
on GPUs so you don't have to worry too much about these

231
00:23:42,000 --> 00:23:48,389
low level hardware details about cuBLAS and cuDNN and
CUDA and moving data between the CPU and GPU memory.

232
00:23:48,389 --> 00:23:52,439
You kind of want all those messy
details to be taken care of for you.

233
00:23:52,439 --> 00:23:59,450
So those are kind of some of the major reasons why you might choose
to use frameworks rather than writing your own stuff from scratch.

234
00:23:59,450 --> 00:24:05,231
So as kind of a concrete example of a computational
graph we can maybe write down this super simple thing.

235
00:24:05,231 --> 00:24:13,071
Where we have three inputs, X, Y, and Z. We're going to combine X and
Y to produce A. Then we're going to combine A and Z to produce B

236
00:24:13,071 --> 00:24:18,630
and then finally we're going to do some maybe summing
out operation on B to give some scaler final result C.

237
00:24:18,630 --> 00:24:31,631
So you've probably written enough Numpy code at this point to realize that it's super easy to write down,
to implement this computational graph, or rather to implement this bit of computation in Numpy, right?

238
00:24:31,631 --> 00:24:41,923
You can just kind of write down in Numpy that you want to generate some random data, you want to multiply two
things, you want to add two things, you want to sum out a couple things. And it's really easy to do this in Numpy.

239
00:24:41,923 --> 00:24:48,355
But then the question is like suppose that we want to
compute the gradient of C with respect to X, Y, and Z.

240
00:24:48,355 --> 00:24:52,725
So, if you're working in Numpy, you kind of
need to write out this backward pass yourself.

241
00:24:52,725 --> 00:25:02,859
And you've gotten a lot of practice with this on the homeworks, but it can be kind of a
pain and a little bit annoying and messy once you get to really big complicated things.

242
00:25:02,859 --> 00:25:05,675
The other problem with Numpy is
that it doesn't run on the GPU.

243
00:25:05,675 --> 00:25:14,920
So Numpy is definitely CPU only. And you're never going to be able to experience or
take advantage of these GPU accelerated speedups if you're stuck working in Numpy.

244
00:25:14,920 --> 00:25:19,527
And it's, again, it's a pain to have to compute
your own gradients in all these situations.

245
00:25:19,527 --> 00:25:29,047
So, kind of the goal of most deep learning frameworks these days is to
let you write code in the forward pass that looks very similar to Numpy,

246
00:25:29,047 --> 00:25:33,069
but lets you run it on the GPU and lets
you automatically compute gradients.

247
00:25:33,069 --> 00:25:36,397
And that's kind of the big picture
goal of most of these frameworks.

248
00:25:36,397 --> 00:25:44,314
So if you imagine looking at, if we look at an example in TensorFlow of
the exact same computational graph, we now see that in this forward pass,

249
00:25:44,314 --> 00:25:52,687
you write this code that ends up looking very very similar to the Numpy forward
pass where you're kind of doing these multiplication and these addition operations.

250
00:25:52,687 --> 00:25:57,623
But now TensorFlow has this magic line that
just computes all the gradients for you.

251
00:25:57,623 --> 00:26:02,235
So now you don't have go in and write your own
backward pass and that's much more convenient.

252
00:26:02,235 --> 00:26:08,926
The other nice thing about TensorFlow is you can really just, like
with one line you can switch all this computation between CPU and GPU.

253
00:26:08,926 --> 00:26:16,668
So here, if you just add this with statement before you're doing this forward pass,
you just can explicitly tell the framework, hey I want to run this code on the CPU.

254
00:26:16,668 --> 00:26:24,866
But now if we just change that with statement a little bit with just with a one
character change in this case, changing that C to a G, now the code runs on GPU.

255
00:26:24,866 --> 00:26:31,388
And now in this little code snippet, we've solved
these two problems. We're running our code on the GPU

256
00:26:31,388 --> 00:26:35,685
and we're having the framework compute all
the gradients for us, so that's really nice.

257
00:26:35,685 --> 00:26:38,459
And PyTorch kind looks
almost exactly the same.

258
00:26:38,459 --> 00:26:42,509
So again, in PyTorch you kind of write
down, you define some variables,

259
00:26:42,509 --> 00:26:49,262
you have some forward pass and the forward pass again looks
very similar to like, in this case identical to the Numpy code.

260
00:26:49,262 --> 00:26:56,251
And then again, you can just use PyTorch to compute
gradients, all your gradients with just one line.

261
00:26:56,251 --> 00:27:06,781
And now in PyTorch again, it's really easy to switch to GPU, you just need to cast all your stuff to the
CUDA data type before you rung your computation and now everything runs transparently on the GPU for you.

262
00:27:06,781 --> 00:27:13,878
So if you kind of just look at these three examples, these three
snippets of code side by side, the Numpy, the TensorFlow and the PyTorch

263
00:27:13,878 --> 00:27:20,564
you see that the TensorFlow and the PyTorch code in
the forward pass looks almost exactly like Numpy

264
00:27:20,564 --> 00:27:24,349
which is great 'cause Numpy has a beautiful
API, it's really easy to work with.

265
00:27:24,349 --> 00:27:29,192
But we can compute gradients automatically
and we can run the GPU automatically.

266
00:27:30,186 --> 00:27:37,502
So after that kind of introduction, I wanted to dive in and talk in a little
bit more detail about kind of what's going on inside this TensorFlow example.

267
00:27:37,502 --> 00:27:50,662
So as a running example throughout the rest of the lecture, I'm going to use the training a two-layer fully
connected ReLU network on random data as kind of a running example throughout the rest of the examples here.

268
00:27:50,662 --> 00:27:55,289
And we're going to train this thing with
an L2 Euclidean loss on random data.

269
00:27:55,289 --> 00:28:08,966
So this is kind of a silly network, it's not really doing anything useful, but it does give you the, it's relatively small, self contained,
the code fits on the slide without being too small, and it lets you demonstrate kind of a lot of the useful ideas inside these frameworks.

270
00:28:08,966 --> 00:28:15,900
So here on the right, oh, and then another note, I'm kind of assuming that
Numpy and TensorFlow have already been imported in all these code snippets.

271
00:28:15,900 --> 00:28:21,163
So in TensorFlow you would typically divide
your computation into two major stages.

272
00:28:21,163 --> 00:28:28,363
First, we're going to write some code that defines our
computational graph, and that's this red code up in the top half.

273
00:28:28,363 --> 00:28:32,360
And then after you define your graph, you're
going to run the graph over and over again

274
00:28:32,360 --> 00:28:36,851
and actually feed data into the graph to perform
whatever computation you want it to perform.

275
00:28:36,851 --> 00:28:40,961
So this is the really, this is kind of
the big common pattern in TensorFlow.

276
00:28:40,961 --> 00:28:46,615
You'll first have a bunch of code that builds the graph and
then you'll go and run the graph and reuse it many many times.

277
00:28:48,099 --> 00:28:52,763
So if you kind of dive into the code
of building the graph in this case.

278
00:28:52,763 --> 00:29:00,709
Up at the top you see that we're defining this X, Y, w1
and w2, and we're creating these tf.placeholder objects.

279
00:29:01,637 --> 00:29:05,193
So these are going to be
input nodes to the graph.

280
00:29:05,193 --> 00:29:15,379
These are going to be sort of entry points to the graph where when we run the graph, we're
going to feed in data and put them in through these input slots in our computational graph.

281
00:29:15,379 --> 00:29:21,944
So this is not actually like allocating any memory right now.
We're just sort of setting up these input slots to the graph.

282
00:29:23,272 --> 00:29:28,665
Then we're going to use those input slots which
are now kind of like these symbolic variables

283
00:29:28,665 --> 00:29:37,135
and we're going to perform different TensorFlow operations on these symbolic
variables in order to set up what computation we want to run on those variables.

284
00:29:37,135 --> 00:29:46,109
So in this case we're doing a matrix multiplication between X and w1, we're
doing some tf.maximum to do a ReLU nonlinearity and then we're doing another

285
00:29:46,109 --> 00:29:49,240
matrix multiplication to
compute our output predictions.

286
00:29:49,240 --> 00:29:58,175
And then we're again using a sort of basic Tensor operations to compute our
Euclidean distance, our L2 loss between our prediction and the target Y.

287
00:29:58,175 --> 00:30:05,824
Another thing to point out here is that these lines of code are not
actually computing anything. There's no data in the system right now.

288
00:30:05,824 --> 00:30:15,001
We're just building up this computational graph data structure telling
TensorFlow which operations we want to eventually run once we put in real data.

289
00:30:15,001 --> 00:30:18,648
So this is just building the graph,
this is not actually doing anything.

290
00:30:18,648 --> 00:30:33,135
Then we have this magical line where after we've computed our loss with these symbolic operations, then we can just
ask TensorFlow to compute the gradient of the loss with respect to w1 and w2 in this one magical, beautiful line.

291
00:30:33,135 --> 00:30:37,981
And this avoids you writing all your own backprop
code that you had to do in the assignments.

292
00:30:37,981 --> 00:30:40,439
But again there's no actual
computation happening here.

293
00:30:40,439 --> 00:30:51,108
This is just sort of adding extra operations to the computational graph where now the computational
graph has these additional operations which will end up computing these gradients for you.

294
00:30:51,108 --> 00:31:01,421
So now at this point we've computed our computational graph, we have this big graph in this graph data
structure in memory that knows what operations we want to perform to compute the loss in gradients.

295
00:31:01,421 --> 00:31:06,843
And now we enter a TensorFlow session to
actually run this graph and feed it with data.

296
00:31:06,843 --> 00:31:13,859
So then, once we've entered the session, then we actually need
to construct some concrete values that will be fed to the graph.

297
00:31:13,859 --> 00:31:19,459
So TensorFlow just expects to receive
data from Numpy arrays in most cases.

298
00:31:19,459 --> 00:31:30,226
So here we're just creating concrete actual values for X, Y, w1
and w2 using Numpy and then storing these in some dictionary.

299
00:31:30,226 --> 00:31:32,743
And now here is where we're
actually running the graph.

300
00:31:32,743 --> 00:31:38,120
So you can see that we're calling a session.run
to actually execute some part of the graph.

301
00:31:38,120 --> 00:31:43,899
The first argument loss, tells us which part
of the graph do we actually want as output.

302
00:31:43,899 --> 00:31:50,950
And that, so we actually want the graph, in this case we need to
tell it that we actually want to compute loss and grad1 and grad w2

303
00:31:50,950 --> 00:31:57,140
and we need to pass in with this feed dict parameter the
actual concrete values that will be fed to the graph.

304
00:31:57,140 --> 00:32:06,541
And then after, in this one line, it's going and running the
graph and then computing those values for loss grad1 to grad w2

305
00:32:06,541 --> 00:32:12,003
and then returning the actual concrete
values for those in Numpy arrays again.

306
00:32:12,003 --> 00:32:19,859
So now after you unpack this output in the second line, you get Numpy
arrays, or you get Numpy arrays with the loss and the gradients.

307
00:32:19,859 --> 00:32:23,697
So then you can go and do whatever
you want with these values.

308
00:32:23,697 --> 00:32:29,599
So then, this has only run sort of one
forward and backward pass through our graph,

309
00:32:29,599 --> 00:32:33,167
and it only takes a couple extra lines if
we actually want to train the network.

310
00:32:33,167 --> 00:32:45,511
So here we're, now we're running the graph many times in a loop so we're doing a four loop and in each
iteration of the loop, we're calling session.run asking it to compute the loss and the gradients.

311
00:32:45,511 --> 00:32:52,291
And now we're doing a manual gradient discent step using those
computed gradients to now update our current values of the weights.

312
00:32:52,291 --> 00:33:00,749
So if you actually run this code and plot the losses, then you'll see that the
loss goes down and the network is training and this is working pretty well.

313
00:33:00,749 --> 00:33:06,113
So this is kind of like a super bare bones example
of training a fully connected network in TensorFlow.

314
00:33:06,113 --> 00:33:08,046
But there's a problem here.

315
00:33:08,046 --> 00:33:15,086
So here, remember that on the forward pass, every time we
execute this graph, we're actually feeding in the weights.

316
00:33:15,086 --> 00:33:18,835
We have the weights as Numpy arrays and we're
explicitly feeding them into the graph.

317
00:33:18,835 --> 00:33:26,339
And now when the graph finishes executing it's going to give us these
gradients. And remember the gradients are the same size as the weights.

318
00:33:26,339 --> 00:33:32,665
So this means that every time we're running the graph here, we're copying
the weights from Numpy arrays into TensorFlow then getting the gradients

319
00:33:32,665 --> 00:33:36,419
and then copying the gradients from
TensorFlow back out to Numpy arrays.

320
00:33:36,419 --> 00:33:39,849
So if you're just running on CPU,
this is maybe not a huge deal,

321
00:33:39,849 --> 00:33:47,235
but remember we talked about CPU GPU bottleneck and how it's very
expensive actually to copy data between CPU memory and GPU memory.

322
00:33:47,235 --> 00:33:59,256
So if your network is very large and your weights and gradients were very big, then doing something like this would be super
expensive and super slow because we'd be copying all kinds of data back and forth between the CPU and the GPU at every time step.

323
00:33:59,256 --> 00:34:01,689
So that's bad, we don't want to do that.
We need to fix that.

324
00:34:01,689 --> 00:34:06,027
So, obviously TensorFlow
has some solution to this.

325
00:34:06,027 --> 00:34:17,969
And the idea is that now we want our weights, w1 and w2, rather than being placeholders where we're going to,
where we expect to feed them in to the network on every forward pass, instead we define them as variables.

326
00:34:17,969 --> 00:34:27,346
So a variable is something is a value that lives inside the computational graph and it's going
to persist inside the computational graph across different times when you run the same graph.

327
00:34:27,347 --> 00:34:33,094
So now instead of declaring these w1 and w2 as
placeholders, instead we just construct them as variables.

328
00:34:33,094 --> 00:34:39,219
But now since they live inside the graph, we also need to
tell TensorFlow how they should be initialized, right?

329
00:34:39,219 --> 00:34:44,606
Because in the previous case we were feeding in their values
from outside the graph, so we initialized them in Numpy,

330
00:34:44,606 --> 00:34:50,569
but now because these things live inside the graph,
TensorFlow is responsible for initializing them.

331
00:34:50,569 --> 00:34:53,149
So we need to pass in a
tf.randomnormal operation,

332
00:34:53,149 --> 00:35:00,627
which again is not actually initializing them when we run this line,
this is just telling TensorFlow how we want them to be initialized.

333
00:35:00,627 --> 00:35:03,215
So it's a little bit of confusing
misdirection going on here.

334
00:35:04,869 --> 00:35:11,862
And now, remember in the previous example we were actually
updating the weights outside of the computational graph.

335
00:35:11,862 --> 00:35:17,219
We, in the previous example, we were computing the gradients
and then using them to update the weights as Numpy arrays

336
00:35:17,219 --> 00:35:20,264
and then feeding in the updated
weights at the next time step.

337
00:35:20,264 --> 00:35:29,402
But now because we want these weights to live inside the graph, this operation of
updating the weights needs to also be an operation inside the computational graph.

338
00:35:29,402 --> 00:35:37,020
So now we used this assign function which mutates
these variables inside the computational graph

339
00:35:37,020 --> 00:35:41,487
and now the mutated value will persist
across multiple runs of the same graph.

340
00:35:41,487 --> 00:35:45,976
So now when we run this graph
and when we train the network,

341
00:35:45,976 --> 00:35:53,825
now we need to run the graph once with a little bit of special incantation to tell
TensorFlow to set up these variables that are going to live inside the graph.

342
00:35:53,825 --> 00:35:58,574
And then once we've done that initialization,
now we can run the graph over and over again.

343
00:35:58,574 --> 00:36:05,091
And here, we're now only feeding in the data and labels
X and Y and the weights are living inside the graph.

344
00:36:05,091 --> 00:36:09,517
And here we've asked the network to, we've
asked TensorFlow to compute the loss for us.

345
00:36:09,517 --> 00:36:13,001
And then you might think that
this would train the network,

346
00:36:13,001 --> 00:36:19,964
but there's actually a bug here. So, if you actually run
this code, and you plot the loss, it doesn't train.

347
00:36:19,964 --> 00:36:23,401
So that's bad, it's confusing,
like what's going on?

348
00:36:23,401 --> 00:36:29,957
We wrote this assign code, we ran the thing, like we computed the
loss and the gradients and our loss is flat, what's going on?

349
00:36:29,957 --> 00:36:31,460
Any ideas?

350
00:36:31,460 --> 00:36:34,595
[student's words obscured
due to lack of microphone]

351
00:36:34,595 --> 00:36:44,979
Yeah so one hypothesis is that maybe we're accidentally re-initializing the w's every time
we call the graph. That's a good hypothesis, that's actually not the problem in this case.

352
00:36:44,979 --> 00:36:48,057
[student's words obscured
due to lack of microphone]

353
00:36:48,057 --> 00:36:56,318
Yeah, so the answer is that we actually need to explicitly tell
TensorFlow that we want to run these new w1 and new w2 operations.

354
00:36:56,318 --> 00:36:58,835
So we've built up this big
computational graph data

355
00:36:58,835 --> 00:37:01,699
structure in memory and
now when we call run,

356
00:37:01,699 --> 00:37:04,894
we only told TensorFlow that
we wanted to compute loss.

357
00:37:04,894 --> 00:37:09,155
And if you look at the dependencies among
these different operations inside the graph,

358
00:37:09,155 --> 00:37:13,715
you see that in order to compute loss we don't
actually need to perform this update operation.

359
00:37:13,715 --> 00:37:21,496
So TensorFlow is smart and it only computes the parts of the graph that
are necessary for computing the output that you asked it to compute.

360
00:37:21,496 --> 00:37:26,656
So that's kind of a nice thing because it means
it's only doing as much work as it needs to,

361
00:37:26,656 --> 00:37:32,739
but in situations like this it can be a little bit
confusing and lead to behavior that you didn't expect.

362
00:37:32,739 --> 00:37:39,141
So the solution in this case is that we actually need to
explicitly tell TensorFlow to perform those update operations.

363
00:37:39,141 --> 00:37:49,531
So one thing we could do, which is what was suggested is we could add new w1 and new w2
as outputs and just tell TensorFlow that we want to produce these values as outputs.

364
00:37:49,531 --> 00:37:57,366
But that's a problem too because the values, those
new w1, new w2 values are again these big tensors.

365
00:37:58,891 --> 00:38:05,138
So now if we tell TensorFlow we want those as output, we're going to
again get this copying behavior between CPU and GPU at ever iteration.

366
00:38:05,138 --> 00:38:07,316
So that's bad, we don't want that.

367
00:38:07,316 --> 00:38:11,742
So there's a little trick you can do instead. Which
is that we add kind of a dummy node to the graph.

368
00:38:11,742 --> 00:38:20,307
With these fake data dependencies and we just say that this dummy
node updates, has these data dependencies of new w1 and new w2.

369
00:38:20,307 --> 00:38:25,803
And now when we actually run the graph, we tell
it to compute both the loss and this dummy node.

370
00:38:25,803 --> 00:38:38,468
And this dummy node doesn't actually return any value it just returns none, but because of this dependency that we've
put into the node it ensures that when we run the updates value, we actually also run these update operations.

371
00:38:38,468 --> 00:38:39,551
So, question?

372
00:38:40,788 --> 00:38:44,955
[student's words obscured
due to lack of microphone]

373
00:38:45,854 --> 00:38:51,370
Is there a reason why we didn't put X and Y
into the graph? And that it stayed as Numpy.

374
00:38:51,370 --> 00:38:57,151
So in this example we're reusing X and Y on every,
we're reusing the same X and Y on every iteration.

375
00:38:57,151 --> 00:39:10,122
So you're right, we could have just also stuck those in the graph, but in a more realistic scenario, X and Y will be minibatches
of data so those will actually change at every iteration and we will want to feed different values for those at every iteration.

376
00:39:10,122 --> 00:39:14,330
So in this case, they could have stayed in the
graph, but in most cases they will change,

377
00:39:14,330 --> 00:39:17,913
so we don't want them
to live in the graph.

378
00:39:19,388 --> 00:39:21,290
Oh, another question?

379
00:39:21,290 --> 00:39:25,457
[student's words obscured
due to lack of microphone]

380
00:39:37,046 --> 00:39:44,305
Yeah, so we've told it, we had put into TensorFlow
that the outputs we want are loss and updates.

381
00:39:44,305 --> 00:39:51,801
Updates is not actually a real value. So
when updates evaluates it just returns none.

382
00:39:51,801 --> 00:39:57,416
But because of this dependency we've told it
that updates depends on these assign operations.

383
00:39:57,416 --> 00:40:02,358
But these assign operations live inside the
computational graph and all live inside GPU memory.

384
00:40:02,358 --> 00:40:10,190
So then we're doing these update operations entirely on the GPU and
we're no longer copying the updated values back out of the graph.

385
00:40:11,723 --> 00:40:15,112
[student's words obscured
due to lack of microphone]

386
00:40:15,112 --> 00:40:18,195
So the question is does
tf.group return none?

387
00:40:18,195 --> 00:40:25,923
So this gets into the trickiness of TensorFlow.
So tf.group returns some crazy TensorFlow value.

388
00:40:25,923 --> 00:40:32,658
It sort of returns some like internal TensorFlow node
operation that we need to continue building the graph.

389
00:40:32,658 --> 00:40:43,333
But when you execute the graph, and when you tell, inside the session.run, when we
told it we want it to compute the concrete value from updates, then that returns none.

390
00:40:43,333 --> 00:40:45,482
So whenever you're working with TensorFlow

391
00:40:45,482 --> 00:40:53,487
you have this funny indirection between building the graph and the actual output
values during building the graph is some funny weird object, and then you actually get

392
00:40:53,487 --> 00:40:55,466
a concrete value when you run the graph.

393
00:40:55,466 --> 00:40:59,967
So here after you run updates, then the output
is none. Does that clear it up a little bit?

394
00:40:59,967 --> 00:41:04,134
[student's words obscured
due to lack of microphone]

395
00:41:18,796 --> 00:41:22,334
So the question is why is loss a value
and why is updates none?

396
00:41:22,334 --> 00:41:24,068
That's just the way that updates works.

397
00:41:24,068 --> 00:41:30,176
So loss is a value when we compute, when we tell TensorFlow
we want to run a tensor, then we get the concrete value.

398
00:41:30,176 --> 00:41:35,753
Updates is this kind of special other data type that
does not return a value, it instead returns none.

399
00:41:35,753 --> 00:41:38,703
So it's kind of some TensorFlow
magic that's going on there.

400
00:41:38,703 --> 00:41:40,602
Maybe we can talk offline
if you're still confused.

401
00:41:40,602 --> 00:41:42,678
[student's words obscured
due to lack of microphone]

402
00:41:42,678 --> 00:41:46,186
Yeah, yeah, that behavior is
coming from the group method.

403
00:41:46,186 --> 00:41:52,492
So now, we kind of have this weird pattern where we wanted to do these
different assign operations, we have to use this funny tf.group thing.

404
00:41:52,492 --> 00:42:00,004
That's kind of a pain, so thankfully TensorFlow gives you some
convenience operations that kind of do that kind of stuff for you.

405
00:42:00,004 --> 00:42:01,706
And that's called an optimizer.

406
00:42:01,706 --> 00:42:06,047
So here we're using a
tf.train.GradientDescentOptimizer

407
00:42:06,047 --> 00:42:08,458
and we're telling it what
learning rate we want to use.

408
00:42:08,458 --> 00:42:12,784
And you can imagine that there's, there's RMSprop, there's
all kinds of different optimization algorithms here.

409
00:42:12,784 --> 00:42:16,284
And now we call optimizer.minimize of loss

410
00:42:17,311 --> 00:42:21,204
and now this is a pretty magical,
this is a pretty magical thing,

411
00:42:21,204 --> 00:42:30,586
because now this call is aware that these variables w1 and w2 are marked as
trainable by default, so then internally, inside this optimizer.minimize

412
00:42:30,586 --> 00:42:35,184
it's going in and adding nodes to the graph which
will compute gradient of loss with respect

413
00:42:35,184 --> 00:42:42,219
to w1 and w2 and then it's also performing that update operation for you
and it's doing the grouping operation for you and it's doing the assigns.

414
00:42:42,219 --> 00:42:44,206
It's like doing a lot of
magical stuff inside there.

415
00:42:44,206 --> 00:42:53,518
But then it ends up giving you this magical updates value which, if you dig through the code
they're actually using tf.group so it looks very similar internally to what we saw before.

416
00:42:53,518 --> 00:43:00,004
And now when we run the graph inside our loop we do the
same pattern of telling it to compute loss and updates.

417
00:43:00,004 --> 00:43:07,450
And every time we tell the graph to compute updates,
then it'll actually go and update the graph.

418
00:43:07,450 --> 00:43:08,593
Question?

419
00:43:08,593 --> 00:43:10,959
[student's words obscured
due to lack of microphone]

420
00:43:10,959 --> 00:43:14,249
Yeah, so what is the
tf.GlobalVariablesInitializer?

421
00:43:14,249 --> 00:43:20,502
So that's initializing w1 and w2 because these
are variables which live inside the graph.

422
00:43:20,502 --> 00:43:37,733
So we need to, when we saw this, when we create the tf.variable we have this tf.randomnormal which is this initialization so the
tf.GlobalVariablesInitializer is causing the tf.randomnormal to actually run and generate concrete values to initialize those variables.

423
00:43:37,733 --> 00:43:40,794
[student's words obscured
due to lack of microphone]

424
00:43:40,794 --> 00:43:42,271
Sorry, what was the question?

425
00:43:42,271 --> 00:43:45,233
[student's words obscured
due to lack of microphone]

426
00:43:45,233 --> 00:43:51,385
So it knows that a placeholder is going to be fed outside of the
graph and a variable is something that lives inside the graph.

427
00:43:51,385 --> 00:44:00,384
So I don't know all the details about how it decides, what exactly it decides to run with that call.
I think you'd need to dig through the code to figure that out, or maybe it's documented somewhere.

428
00:44:00,384 --> 00:44:06,130
So but now we've kind of got this, again we've got this full example
of training a network in TensorFlow and we're kind of adding

429
00:44:06,130 --> 00:44:09,328
bells and whistles to make it
a little bit more convenient.

430
00:44:09,328 --> 00:44:16,954
So we can also here, in the previous example we were computing the loss
explicitly using our own tensor operations, TensorFlow you can always do that,

431
00:44:16,954 --> 00:44:20,739
you can use basic tensor operations to
compute just about anything you want.

432
00:44:20,739 --> 00:44:26,734
But TensorFlow also gives you a bunch of convenience functions
that compute these common neural network things for you.

433
00:44:26,734 --> 00:44:30,040
So in this case we can use
tf.losses.mean_squared_error

434
00:44:30,040 --> 00:44:36,273
and it just does the L2 loss for us so we don't have to
compute it ourself in terms of basic tensor operations.

435
00:44:36,273 --> 00:44:46,667
So another kind of weirdness here is that it was kind of annoying that we had to explicitly define our
inputs and define our weights and then like chain them together in the forward pass using a matrix multiply.

436
00:44:46,667 --> 00:44:54,291
And in this example we've actually not put biases in the layer because
that would be kind of an extra, then we'd have to initialize biases,

437
00:44:54,291 --> 00:44:58,494
we'd have to get them in the right shape, we'd
have to broadcast the biases against the output

438
00:44:58,494 --> 00:45:01,966
of the matrix multiply and you can see
that that would kind of be a lot of code.

439
00:45:01,966 --> 00:45:03,664
It would be kind of annoying write.

440
00:45:03,664 --> 00:45:09,653
And once you get to like convolutions and batch normalizations
and other types of layers this kind of basic way of working,

441
00:45:09,653 --> 00:45:17,403
of having these variables, having these inputs and outputs and combining them
all together with basic computational graph operations could be a little bit

442
00:45:17,403 --> 00:45:22,954
unwieldy and it could be really annoying to make sure you initialize
the weights with the right shapes and all that sort of stuff.

443
00:45:22,954 --> 00:45:30,615
So as a result, there's a bunch of sort of higher level libraries
that wrap around TensorFlow and handle some of these details for you.

444
00:45:30,615 --> 00:45:35,965
So one example that ships with TensorFlow,
is this tf.layers inside.

445
00:45:35,965 --> 00:45:44,060
So now in this code example you can see that our code is only explicitly declaring
the X and the Y which are the placeholders for the data and the labels.

446
00:45:44,060 --> 00:45:53,036
And now we say that H=tf.layers.dense, we
give it the input X and we tell it units=H.

447
00:45:53,036 --> 00:45:55,171
This is again kind of a magical line

448
00:45:55,171 --> 00:46:07,411
because inside this line, it's kind of setting up w1 and b1, the bias, it's setting up variables
for those with the right shapes that are kind of inside the graph but a little bit hidden from us.

449
00:46:07,411 --> 00:46:12,931
And it's using this xavier initializer object
to set up an initialization strategy for those.

450
00:46:12,931 --> 00:46:17,200
So before we were doing that explicitly
ourselves with the tf.randomnormal business,

451
00:46:17,200 --> 00:46:22,266
but now here it's kind of handling some of those
details for us and it's just spitting out an H,

452
00:46:22,266 --> 00:46:27,515
which is again the same sort of H that we saw in the previous
layer, it's just doing some of those details for us.

453
00:46:28,487 --> 00:46:36,910
And you can see here, we're also passing an activation=tf.nn.relu so it's even
doing the activation, the relu activation function inside this layer for us.

454
00:46:36,910 --> 00:46:41,370
So it's taking care of a lot of
these architectural details for us.

455
00:46:41,370 --> 00:46:42,784
Question?

456
00:46:42,784 --> 00:46:46,446
[student's words obscured
due to lack of microphone]

457
00:46:46,446 --> 00:46:51,168
Question is does the xavier initializer
default to particular distribution?

458
00:46:51,168 --> 00:46:55,850
I'm sure it has some default, I'm not sure what it is.
I think you'll have to look at the documentation.

459
00:46:55,850 --> 00:46:58,010
But it seems to be a
reasonable strategy, I guess.

460
00:46:58,010 --> 00:47:04,111
And in fact if you run this code, it converges much faster
than the previous one because the initialization is better.

461
00:47:04,111 --> 00:47:11,911
And you can see that we're using two calls to tf.layers and this lets us build
our model without doing all these explicit bookkeeping details ourself.

462
00:47:11,911 --> 00:47:14,273
So this is maybe a little
bit more convenient.

463
00:47:14,273 --> 00:47:18,682
But tf.contrib.layer is really
not the only game in town.

464
00:47:18,682 --> 00:47:23,349
There's like a lot of different higher level
libraries that people build on top of TensorFlow.

465
00:47:23,349 --> 00:47:26,841
And it's kind of due to this
basic impotence mis-match

466
00:47:26,841 --> 00:47:30,315
where the computational graph
is relatively low level thing,

467
00:47:30,315 --> 00:47:36,426
but when we're working with neural networks we have this concept of
layers and weights and some layers have weights associated with them,

468
00:47:36,426 --> 00:47:41,866
and we typically think at a slightly higher level
of abstraction than this raw computational graph.

469
00:47:41,866 --> 00:47:48,503
So that's what these various packages are trying to help you
out and let you work at this higher layer of abstraction.

470
00:47:48,503 --> 00:47:52,460
So another very popular package that
you may have seen before is Keras.

471
00:47:52,460 --> 00:48:02,806
Keras is a very beautiful, nice API that sits on top of TensorFlow and handles
sort of building up these computational graph for you up in the back end.

472
00:48:02,806 --> 00:48:07,704
By the way, Keras also supports Theano as
a back end, so that's also kind of nice.

473
00:48:07,704 --> 00:48:10,958
And in this example you can see we build
the model as a sequence of layers.

474
00:48:10,958 --> 00:48:17,910
We build some optimizer object and we call model.compile and
this does a lot of magic in the back end to build the graph.

475
00:48:17,910 --> 00:48:22,797
And now we can call model.fit and that does the
whole training procedure for us magically.

476
00:48:22,797 --> 00:48:28,523
So I don't know all the details of how this works, but I know Keras is very
popular, so you might consider using it if you're talking about TensorFlow.

477
00:48:29,797 --> 00:48:31,270
Question?

478
00:48:31,270 --> 00:48:35,437
[student's words obscured
due to lack of microphone]

479
00:48:41,717 --> 00:48:45,525
Yeah, so the question is like why there's
no explicit CPU, GPU going on here.

480
00:48:45,525 --> 00:48:48,409
So I've kind of left that
out to keep the code clean.

481
00:48:48,409 --> 00:48:54,607
But you saw at the beginning examples it was pretty easy to flop all
these things between CPU and GPU and there was either some global flag

482
00:48:54,607 --> 00:49:01,635
or some different data type or some with statement and it's usually
relatively simple and just about one line to swap in each case.

483
00:49:01,635 --> 00:49:06,149
But exactly what that line looks like
differs a bit depending on the situation.

484
00:49:06,149 --> 00:49:14,186
So there's actually like this whole large set of higher level
TensorFlow wrappers that you might see out there in the wild.

485
00:49:14,186 --> 00:49:21,276
And it seems that like even people within Google can't
really agree on which one is the right one to use.

486
00:49:22,230 --> 00:49:26,829
So Keras and TFLearn are third party libraries that
are out there on the internet by other people.

487
00:49:26,829 --> 00:49:32,563
But there's these three different ones,
tf.layers, TF-Slim and tf.contrib.learn

488
00:49:32,563 --> 00:49:39,727
that all ship with TensorFlow, that are all kind of doing a
slightly different version of this higher level wrapper thing.

489
00:49:39,727 --> 00:49:46,291
There's another framework also from Google, but not shipping with
TensorFlow called Pretty Tensor that does the same sort of thing.

490
00:49:46,291 --> 00:49:48,599
And I guess none of these
were good enough for DeepMind,

491
00:49:48,599 --> 00:49:54,530
because they went ahead a couple weeks ago and wrote and released
their very own high level TensorFlow wrapper called Sonnet.

492
00:49:54,530 --> 00:50:00,715
So I wouldn't begrudge you if you were kind of confused
by all these things. There's a lot of different choices.

493
00:50:00,715 --> 00:50:07,423
They don't always play nicely with each other.
But you have a lot of options, so that's good.

494
00:50:07,423 --> 00:50:09,123
TensorFlow has pretrained models.

495
00:50:09,123 --> 00:50:11,112
There's some examples in
TF-Slim, and in Keras.

496
00:50:11,112 --> 00:50:15,874
'Cause remember pretrained models are super
important when you're training your own things.

497
00:50:15,874 --> 00:50:21,072
There's also this idea of Tensorboard where you can
load up your, I don't want to get into details,

498
00:50:21,072 --> 00:50:27,747
but Tensorboard you can add sort of instrumentation to your code and
then plot losses and things as you go through the training process.

499
00:50:27,747 --> 00:50:32,760
TensorFlow also let's you run distributed where you can
break up a computational graph run on different machines.

500
00:50:32,760 --> 00:50:37,613
That's super cool but I think probably not anyone
outside of Google is really using that to great success

501
00:50:37,613 --> 00:50:44,193
these days, but if you do want to run distributed stuff
probably TensorFlow is the main game in town for that.

502
00:50:44,193 --> 00:50:51,533
A side note is that a lot of the design of TensorFlow is kind of spiritually
inspired by this earlier framework called Theano from Montreal.

503
00:50:51,533 --> 00:50:55,933
I don't want to go through the details here,
just if you go through these slides on your own,

504
00:50:55,933 --> 00:50:59,979
you can see that the code for Theano ends
up looking very similar to TensorFlow.

505
00:50:59,979 --> 00:51:03,512
Where we define some variables, we do some
forward pass, we compute some gradients,

506
00:51:03,512 --> 00:51:08,034
and we compile some function, then we run the
function over and over to train the network.

507
00:51:08,034 --> 00:51:10,290
So it kind of looks a lot like TensorFlow.

508
00:51:10,290 --> 00:51:16,671
So we still have a lot to get through, so I'm going to
move on to PyTorch and maybe take questions at the end.

509
00:51:16,671 --> 00:51:26,397
So, PyTorch from Facebook is kind of different from TensorFlow in that we
have sort of three explicit different layers of abstraction inside PyTorch.

510
00:51:26,397 --> 00:51:30,619
So PyTorch has this tensor object
which is just like a Numpy array.

511
00:51:30,619 --> 00:51:36,770
It's just an imperative array, it doesn't know
anything about deep learning, but it can run with GPU.

512
00:51:36,770 --> 00:51:44,093
We have this variable object which is a node in a computational graph which
builds up computational graphs, lets you compute gradients, that sort of thing.

513
00:51:44,093 --> 00:51:50,766
And we have a module object which is a neural network layer that
you can compose together these modules to build big networks.

514
00:51:50,766 --> 00:52:01,457
So if you kind of want to think about rough equivalents between PyTorch and TensorFlow you can
think of the PyTorch tensor as fulfilling the same role as the Numpy array in TensorFlow.

515
00:52:01,457 --> 00:52:08,803
The PyTorch variable is similar to the TensorFlow tensor or variable
or placeholder, which are all sort of nodes in a computational graph.

516
00:52:08,803 --> 00:52:18,448
And now the PyTorch module is kind of equivalent to these higher level things
from tf.slim or tf.layers or sonnet or these other higher level frameworks.

517
00:52:18,448 --> 00:52:24,072
So right away one thing to notice about PyTorch is
that because it ships with this high level abstraction

518
00:52:24,072 --> 00:52:29,780
and like one really nice higher level abstraction called
modules on its own, there's sort of less choice involved.

519
00:52:29,780 --> 00:52:36,642
Just stick with nnmodules and you'll be good to go. You don't
need to worry about which higher level wrapper to use.

520
00:52:37,777 --> 00:52:41,944
So PyTorch tensors, as I said,
are just like Numpy arrays

521
00:52:43,660 --> 00:52:47,787
so here on the right we've done an entire two
layer network using entirely PyTorch tensors.

522
00:52:47,787 --> 00:52:53,910
One thing to note is that we're not importing Numpy here at all
anymore. We're just doing all these operations using PyTorch tensors.

523
00:52:53,910 --> 00:53:01,245
And this code looks exactly like the two layer net
code that you wrote in Numpy on the first homework.

524
00:53:01,245 --> 00:53:07,127
So you set up some random data, you use some
operations to compute the forward pass.

525
00:53:07,127 --> 00:53:10,165
And then we're explicitly viewing
the backward pass ourself.

526
00:53:10,165 --> 00:53:15,980
Just sort of backhopping through the network, through
the operations, just as you did on homework one.

527
00:53:15,980 --> 00:53:22,672
And now we're doing a manual update of the weights using
a learning rate and using our computed gradients.

528
00:53:22,672 --> 00:53:27,785
But the major difference between the PyTorch
tensor and Numpy arrays is that they run on GPU

529
00:53:27,785 --> 00:53:33,034
so all you have to do to make this code
run on GPU is use a different data type.

530
00:53:33,034 --> 00:53:42,816
Rather than using torch.FloatTensor, you do torch.cuda.FloatTensor, cast all
of your tensors to this new datatype and everything runs magically on the GPU.

531
00:53:43,709 --> 00:53:47,637
You should think of PyTorch
tensors as just Numpy plus GPU.

532
00:53:47,637 --> 00:53:50,818
That's exactly what it is, nothing
specific to deep learning.

533
00:53:52,638 --> 00:53:55,278
So the next layer of abstraction
in PyTorch is the variable.

534
00:53:55,278 --> 00:54:03,460
So this is, once we moved from tensors to variables now we're building computational
graphs and we're able to take gradients automatically and everything like that.

535
00:54:03,460 --> 00:54:12,744
So here, if X is a variable, then x.data is a tensor and x.grad is another
variable containing the gradients of the loss with respect to that tensor.

536
00:54:14,007 --> 00:54:17,246
So x.grad.data is an actual tensor
containing those gradients.

537
00:54:18,972 --> 00:54:22,387
And PyTorch tensors and variables
have the exact same API.

538
00:54:22,387 --> 00:54:28,457
So any code that worked on PyTorch tensors you can just
make them variables instead and run the same code,

539
00:54:28,457 --> 00:54:34,459
except now you're building up a computational graph
rather than just doing these imperative operations.

540
00:54:35,943 --> 00:54:47,461
So here when we create these variables each call to the variable constructor wraps a PyTorch tensor
and then also gives a flag whether or not we want to compute gradients with respect to this variable.

541
00:54:47,461 --> 00:54:54,073
And now in the forward pass it looks exactly like it did before in the
variable in the case with tensors because they have the same API.

542
00:54:54,073 --> 00:54:59,683
So now we're computing our predictions, we're computing
our loss in kind of this imperative kind of way.

543
00:54:59,683 --> 00:55:05,251
And then we call loss.backwards and now
all these gradients come out for us.

544
00:55:05,251 --> 00:55:11,528
And then we can make a gradient update step on our weights
using the gradients that are now present in the w1.grad.data.

545
00:55:11,528 --> 00:55:18,137
So this ends up looking quite like the Numpy
case, except all the gradients come for free.

546
00:55:18,137 --> 00:55:23,353
One thing to note that's kind of different between
PyTorch and TensorFlow is that in a TensorFlow case

547
00:55:23,353 --> 00:55:27,132
we were building up this explicit graph,
then running the graph many times.

548
00:55:27,132 --> 00:55:32,152
Here in PyTorch, instead we're building up a
new graph every time we do a forward pass.

549
00:55:32,152 --> 00:55:37,058
And this makes the code look a bit cleaner. And it has
some other implications that we'll get to in a bit.

550
00:55:37,058 --> 00:55:40,630
So in PyTorch you can define
your own new autograd functions

551
00:55:40,630 --> 00:55:42,933
by defining the forward and
backward in terms of tensors.

552
00:55:42,933 --> 00:55:48,303
This ends up looking kind of like the module
layers code that you write for homework two.

553
00:55:48,303 --> 00:55:54,433
Where you can implement forward and backward using tensor
operations and then stick these things inside computational graph.

554
00:55:54,433 --> 00:56:00,654
So here we're defining our own relu and then
we can actually go in and use our own relu

555
00:56:00,654 --> 00:56:05,214
operation and now stick it inside our computational
graph and define our own operations this way.

556
00:56:05,214 --> 00:56:09,097
But most of the time you will probably not
need to define your own autograd operations.

557
00:56:09,097 --> 00:56:14,246
Most of the times the operations you need
will mostly be already implemented for you.

558
00:56:14,246 --> 00:56:23,349
So in TensorFlow we saw, if we can move to something like Keras or TF.Learn and this
gives us a higher level API to work with, rather than this raw computational graphs.

559
00:56:23,349 --> 00:56:30,948
The equivalent in PyTorch is the nn package. Where it provides
these high level wrappers for working with these things.

560
00:56:31,882 --> 00:56:37,772
But unlike TensorFlow there's only one of them. And it works
pretty well, so just use that if you're using PyTorch.

561
00:56:37,772 --> 00:56:44,436
So here, this ends up kind of looking like Keras where we define our
model as some sequence of layers. Our linear and relu operations.

562
00:56:44,436 --> 00:56:49,816
And we use some loss function defined in the
nn package that's our mean squared error loss.

563
00:56:49,816 --> 00:56:55,214
And now inside each iteration of our loop we can run
data forward through the model to get our predictions.

564
00:56:55,214 --> 00:56:59,054
We can run the predictions forward through
the loss function to get our scale or loss,

565
00:56:59,054 --> 00:57:04,021
then we can call loss.backward, get all our gradients for
free and then loop over the parameters of the models

566
00:57:04,021 --> 00:57:07,273
and do our explicit gradient
descent step to update the models.

567
00:57:07,273 --> 00:57:12,749
And again we see that we're sort of building up this new
computational graph every time we do a forward pass.

568
00:57:12,749 --> 00:57:17,017
And just like we saw in TensorFlow, PyTorch
provides these optimizer operations

569
00:57:17,017 --> 00:57:23,000
that kind of abstract away this updating logic and
implement fancier update rules like Adam and whatnot.

570
00:57:23,000 --> 00:57:28,771
So here we're constructing an optimizer object telling it
that we want it to optimize over the parameters of the model.

571
00:57:28,771 --> 00:57:31,115
Giving it some learning rate
under the hyper parameters.

572
00:57:31,115 --> 00:57:39,810
And now after we compute our gradients we can just call optimizer.step
and it updates all the parameters of the model for us right here.

573
00:57:39,810 --> 00:57:44,714
So another common thing you'll do in PyTorch
a lot is define your own nn modules.

574
00:57:44,714 --> 00:57:51,801
So typically you'll write your own class which defines
you entire model as a single new nn module class.

575
00:57:51,801 --> 00:58:01,043
And a module is just kind of a neural network layer that can contain either
other other modules or trainable weights or other other kinds of state.

576
00:58:01,043 --> 00:58:07,051
So in this case we can redo the two layer net
example by defining our own nn module class.

577
00:58:07,051 --> 00:58:11,672
So now here in the initializer of the class
we're assigning this linear1 and linear2.

578
00:58:11,672 --> 00:58:17,257
We're constructing these new module objects
and then store them inside of our own class.

579
00:58:17,257 --> 00:58:26,466
And now in the forward pass we can use both our own internal modules as well as
arbitrary autograd operations on variables to compute the output of our network.

580
00:58:26,466 --> 00:58:31,594
So here we receive the, inside this forward
method here, the input x as a variable,

581
00:58:31,594 --> 00:58:35,817
then we pass the variable to our
self.linear1 for the first layer.

582
00:58:35,817 --> 00:58:38,129
We use an autograd op
clamp to complete the relu,

583
00:58:38,129 --> 00:58:42,233
we pass the output of that to the second
linear and then that gives us our output.

584
00:58:42,233 --> 00:58:46,633
And now the rest of this code for training
this thing looks pretty much the same.

585
00:58:46,633 --> 00:58:54,676
Where we build an optimizer and loop over and on ever iteration feed data to
the model, compute the gradients with loss.backwards, call optimizer.step.

586
00:58:54,676 --> 00:59:01,817
So this is like relatively characteristic of what you
might see in a lot of PyTorch type training scenarios.

587
00:59:01,817 --> 00:59:11,166
Where you define your own class, defining your own model that contains other modules and
whatnot and then you have some explicit training loop like this that runs it and updates it.

588
00:59:11,166 --> 00:59:18,873
One kind of nice quality of life thing that you have in PyTorch is a
dataloader. So a dataloader can handle building minibatches for you.

589
00:59:18,873 --> 00:59:27,273
It can handle some of the multi-threading that we talked about for you, where it can actually
use multiple threads in the background to build many batches for you and stream off disk.

590
00:59:27,273 --> 00:59:33,221
So here a dataloader wraps a dataset and
provides some of these abstractions for you.

591
00:59:33,221 --> 00:59:40,208
And in practice when you want to run your own data, you typically will write
your own dataset class which knows how to read your particular type of data

592
00:59:40,208 --> 00:59:44,458
off whatever source you want and then wrap
it in a data loader and train with that.

593
00:59:44,458 --> 00:59:52,233
So, here we can see that now we're iterating over the dataloader
object and at every iteration this is yielding minibatches of data.

594
00:59:52,233 --> 00:59:58,409
And it's internally handling the shuffling of the data and
multithreaded dataloading and all this sort of stuff for you.

595
00:59:58,409 --> 01:00:04,161
So this is kind of a completely PyTorch example and a lot of
PyTorch training code ends up looking something like this.

596
01:00:05,583 --> 01:00:07,587
PyTorch provides pretrained models.

597
01:00:07,587 --> 01:00:11,521
And this is probably the slickest
pretrained model experience I've ever seen.

598
01:00:11,521 --> 01:00:14,268
You just say torchvision.models.alexnet
pretained=true.

599
01:00:14,268 --> 01:00:18,759
That'll go down in the background, download the pretrained
weights for you if you don't already have them,

600
01:00:18,759 --> 01:00:24,242
and then it's right there, you're good
to go. So this is super easy to use.

601
01:00:24,242 --> 01:00:27,094
PyTorch also has, there's
also a package called Visdom

602
01:00:27,094 --> 01:00:33,600
that lets you visualize some of these loss
statistics somewhat similar to Tensorboard.

603
01:00:33,600 --> 01:00:38,569
So that's kind of nice, I haven't actually gotten a chance to play
around with this myself so I can't really speak to how useful it is,

604
01:00:38,569 --> 01:00:45,907
but one of the major differences between Tensorboard and Visdom is that
Tensorboard actually lets you visualize the structure of the computational graph.

605
01:00:45,907 --> 01:00:50,989
Which is really cool, a really useful debugging strategy.
And Visdom does not have that functionality yet.

606
01:00:50,989 --> 01:00:54,761
But I've never really used this myself
so I can't really speak to its utility.

607
01:00:56,350 --> 01:01:05,491
As a bit of an aside, PyTorch is kind of an evolution of, kind of a newer updated version
of an older framework called Torch which I worked with a lot in the last couple of years.

608
01:01:05,491 --> 01:01:13,280
And I don't want to go through the details here, but PyTorch is pretty much
better in a lot of ways than the old Lua Torch, but they actually share a lot

609
01:01:13,280 --> 01:01:18,100
of the sameback end C code for computing with
tensors and GPU operations on tensors and whatnot.

610
01:01:18,100 --> 01:01:23,369
So if you look through this Torch example, some of it ends up
looking kind of similar to PyTorch, some of it's a bit different.

611
01:01:23,369 --> 01:01:25,957
Maybe you can step through this offline.

612
01:01:25,957 --> 01:01:33,011
But kind of the high level differences between Torch and PyTorch are
that Torch is actually in Lua, not Python, unlike these other things.

613
01:01:33,011 --> 01:01:37,748
So learning Lua is a bit of
a turn off for some people.

614
01:01:37,748 --> 01:01:40,009
Torch doesn't have autograd.

615
01:01:40,009 --> 01:01:44,324
Torch is also older, so it's more stable, less susceptible
to bugs, there's maybe more example code for Torch.

616
01:01:45,230 --> 01:01:47,214
They're about the same speeds,
that's not really a concern.

617
01:01:47,214 --> 01:01:54,531
But in PyTorch it's in Python which is great, you've got
autograd which makes it a lot simpler to write complex models.

618
01:01:54,531 --> 01:01:59,670
In Lua Torch you end up writing a lot of your own back
prop code sometimes, so that's a little bit annoying.

619
01:01:59,670 --> 01:02:06,051
But PyTorch is newer, there's less existing code, it's still
subject to change. So it's a little bit more of an adventure.

620
01:02:06,051 --> 01:02:17,765
But at least for me, I kind of prefer, I don't really see much reason for myself to use Torch over
PyTorch anymore at this time. So I'm pretty much using PyTorch exclusively for all my work these days.

621
01:02:18,606 --> 01:02:22,531
We talked about this a little bit about
this idea of static versus dynamic graphs.

622
01:02:22,531 --> 01:02:26,291
And this is one of the main distinguishing
features between PyTorch and TensorFlow.

623
01:02:26,291 --> 01:02:38,145
So we saw in TensorFlow you have these two stages of operation where first you build up this computational
graph, then you run the computational graph over and over again many many times reusing that same graph.

624
01:02:38,145 --> 01:02:42,403
That's called a static computational
graph 'cause there's only one of them.

625
01:02:42,403 --> 01:02:48,771
And we saw PyTorch is quite different where we're actually building up
this new computational graph, this new fresh thing on every forward pass.

626
01:02:48,771 --> 01:02:52,259
That's called a dynamic
computational graph.

627
01:02:52,259 --> 01:02:57,053
For kind of simple cases, with kind of feed forward neural
networks, it doesn't really make a huge difference,

628
01:02:57,053 --> 01:03:00,225
the code ends up kind of similarly
and they work kind of similarly,

629
01:03:00,225 --> 01:03:07,102
but I do want to talk a bit about some of the implications of
static versus dynamic. And what are the tradeoffs of those two.

630
01:03:07,102 --> 01:03:15,286
So one kind of nice idea with static graphs is that because we're kind of
building up one computational graph once, and then reusing it many times,

631
01:03:15,286 --> 01:03:19,571
the framework might have the opportunity to
go in and do optimizations on that graph.

632
01:03:19,571 --> 01:03:26,809
And kind of fuse some operations, reorder some operations, figure out the
most efficient way to operate that graph so it can be really efficient.

633
01:03:26,809 --> 01:03:33,039
And because we're going to reuse that graph many times,
maybe that optimization process is expensive up front,

634
01:03:33,039 --> 01:03:37,230
but we can amortize that cost with the speedups that
we've gotten when we run the graph many many times.

635
01:03:37,230 --> 01:03:44,085
So as kind of a concrete example, maybe if you write some graph which
has convolution and relu operations kind of one after another,

636
01:03:44,085 --> 01:03:54,530
you might imagine that some fancy graph optimizer could go in and actually
output, like emit custom code which has fused operations, fusing the convolution

637
01:03:54,530 --> 01:04:03,445
and the relu so now it's computing the same thing as the code you
wrote, but now might be able to be executed more efficiently.

638
01:04:03,445 --> 01:04:10,419
So I'm not too sure on exactly what the state in
practice of TensorFlow graph optimization is right now,

639
01:04:10,419 --> 01:04:20,131
but at least in principle, this is one place where static graph really,
you can have the potential for doing this optimization in static graphs

640
01:04:20,131 --> 01:04:24,298
where maybe it would be not so
tractable for dynamic graphs.

641
01:04:25,504 --> 01:04:28,931
Another kind of subtle point about static
versus dynamic is this idea of serialization.

642
01:04:28,931 --> 01:04:34,026
So with a static graph you can imagine that
you write this code that builds up the graph

643
01:04:34,026 --> 01:04:39,571
and then once you've built the graph, you have this data structure
in memory that represents the entire structure of your network.

644
01:04:39,571 --> 01:04:42,428
And now you could take that data structure
and just serialize it to disk.

645
01:04:42,428 --> 01:04:45,996
And now you've got the whole structure
of your network saved in some file.

646
01:04:45,996 --> 01:04:55,450
And then you could later re-load that thing and then run that computational graph without access
to the original code that built it. So this would be kind of nice in a deployment scenario.

647
01:04:55,450 --> 01:05:00,424
You might imagine that you might want to train your
network in Python because it's maybe easier to work with,

648
01:05:00,424 --> 01:05:07,759
but then after you serialize that network and then you could deploy it now in maybe a
C++ environment where you don't need to use the original code that built the graph.

649
01:05:07,759 --> 01:05:10,909
So that's kind of a nice
advantage of static graphs.

650
01:05:10,909 --> 01:05:15,793
Whereas with a dynamic graph, because we're interleaving
these processes of graph building and graph execution,

651
01:05:15,793 --> 01:05:22,012
you kind of need the original code at all times
if you want to reuse that model in the future.

652
01:05:22,012 --> 01:05:29,163
On the other hand, some advantages for dynamic graphs are that it kind of makes,
it just makes your code a lot cleaner and a lot easier in a lot of scenarios.

653
01:05:29,163 --> 01:05:38,624
So for example, suppose that we want to do some conditional operation where depending
on the value of some variable Z, we want to do different operations to compute Y.

654
01:05:39,723 --> 01:05:45,070
Where if Z is positive, we want to use one weight matrix,
if Z is negative we want to use a different weight matrix.

655
01:05:45,070 --> 01:05:47,981
And we just want to switch off
between these two alternatives.

656
01:05:47,981 --> 01:05:52,011
In PyTorch because we're using
dynamic graphs, it's super simple.

657
01:05:52,011 --> 01:06:00,795
Your code kind of looks exactly like you would expect, exactly what you would
do in Numpy. You can just use normal Python control flow to handle this thing.

658
01:06:00,795 --> 01:06:05,563
And now because we're building up the graph each time,
each time we perform this operation will take one

659
01:06:05,563 --> 01:06:10,864
of the two paths and build up maybe a different graph
on each forward pass, but for any graph that we do

660
01:06:10,864 --> 01:06:14,337
end up building up, we can back
propagate through it just fine.

661
01:06:14,337 --> 01:06:15,941
And the code is very
clean, easy to work with.

662
01:06:15,941 --> 01:06:23,201
Now in TensorFlow the situations is a little bit
more complicated because we build the graph once,

663
01:06:23,201 --> 01:06:28,400
this control flow operator kind of needs to be
an explicit operator in the TensorFlow graph.

664
01:06:28,400 --> 01:06:36,818
And now, so them you can see that we have this tf.cond call which is kind
of like a TensorFlow version of an if statement, but now it's baked into

665
01:06:36,818 --> 01:06:40,741
the computational graph rather than
using sort of Python control flow.

666
01:06:40,741 --> 01:06:48,729
And the problem is that because we only build the graph once, all the potential
paths of control flow that our program might flow through need to be baked

667
01:06:48,729 --> 01:06:52,523
into the graph at the time we
construct it before we ever run it.

668
01:06:52,523 --> 01:07:03,360
So that means that any kind of control flow operators that you want to have need to be not Python control
flow operators, you need to use some kind of magic, special tensor flow operations to do control flow.

669
01:07:03,360 --> 01:07:05,527
In this case this tf.cond.

670
01:07:06,713 --> 01:07:10,763
Another kind of similar situation
happens if you want to have loops.

671
01:07:10,763 --> 01:07:19,839
So suppose that we want to compute some kind of recurrent relationships where maybe
Y T is equal to Y T minus one plus X T times some weight matrix W and depending on

672
01:07:19,839 --> 01:07:26,436
each time we do this, every time we compute this,
we might have a different sized sequence of data.

673
01:07:26,436 --> 01:07:33,371
And no matter the length of our sequence of data, we just want to compute
this same recurrence relation no matter the size of the input sequence.

674
01:07:33,371 --> 01:07:39,489
So in PyTorch this is super easy. We can
just kind of use a normal for loop in Python

675
01:07:39,489 --> 01:07:47,095
to just loop over the number of times that we want to unroll and now depending on
the size of the input data, our computational graph will end up as different sizes,

676
01:07:47,095 --> 01:07:51,694
but that's fine, we can just back propagate
through each one, one at a time.

677
01:07:51,694 --> 01:07:55,782
Now in TensorFlow this
becomes a little bit uglier.

678
01:07:55,782 --> 01:08:06,364
And again, because we need to construct the graph all at once up front, this control
flow looping construct again needs to be an explicit node in the TensorFlow graph.

679
01:08:06,364 --> 01:08:13,517
So I hope you remember your functional programming because you'll have to
use those kinds of operators to implement looping constructs in TensorFlow.

680
01:08:13,517 --> 01:08:23,024
So in this case, for this particular recurrence relationship you can use a foldl
operation and pass in, sort of implement this particular loop in terms of a foldl.

681
01:08:24,100 --> 01:08:28,734
But what this basically means is that you have this
sense that TensorFlow is almost building its own entire

682
01:08:28,734 --> 01:08:33,212
programming language, using the
language of computational graphs.

683
01:08:33,212 --> 01:08:37,215
And any kind of control flow operator, or any
kind of data structure needs to be rolled

684
01:08:37,215 --> 01:08:44,216
into the computational graph so you can't really utilize all
your favorite paradigms for working imperatively in Python.

685
01:08:44,216 --> 01:08:52,804
You kind of need to relearn a whole separate set of control flow operators. And if you
want to do any kinds of control flow inside your computational graph using TensorFlow.

686
01:08:52,804 --> 01:08:58,238
So at least for me, I find that kind of confusing, a
little bit hard to wrap my head around sometimes,

687
01:08:58,238 --> 01:09:06,722
and I kind of like that using PyTorch dynamic graphs, you can just use your
favorite imperative programming constructs and it all works just fine.

688
01:09:07,737 --> 01:09:21,579
By the way, there actually is some very new library called TensorFlow Fold which is another one of these
layers on top of TensorFlow that lets you implement dynamic graphs, you kind of write your own code

689
01:09:22,416 --> 01:09:32,277
using TensorFlow Fold that looks kind of like a dynamic graph operation and then TensorFlow Fold
does some magic for you and somehow implements that in terms of the static TensorFlow graphs.

690
01:09:32,277 --> 01:09:37,357
This is a super new paper that's being
presented at ICLR this week in France.

691
01:09:37,358 --> 01:09:41,694
So I haven't had the chance to like
dive in and play with this yet.

692
01:09:41,694 --> 01:09:46,455
But my initial impression was that it does add some
amount of dynamic graphs to TensorFlow but it is still

693
01:09:46,455 --> 01:09:51,952
a bit more awkward to work with than the sort
of native dynamic graphs you have in PyTorch.

694
01:09:51,952 --> 01:09:57,257
So then, I thought it might be nice to motivate like
why would we care about dynamic graphs in general?

695
01:09:57,257 --> 01:10:00,257
So one option is recurrent networks.

696
01:10:01,177 --> 01:10:07,612
So you can see that for something like image captioning we use a
recurrent network which operates over sequences of different lengths.

697
01:10:07,612 --> 01:10:13,337
In this case, the sentence that we want to generate
as a caption is a sequence and that sequence can vary

698
01:10:13,337 --> 01:10:15,636
depending on our input data.

699
01:10:15,636 --> 01:10:21,694
So now you can see that we have this dynamism in the
thing where depending on the size of the sentence,

700
01:10:21,694 --> 01:10:25,716
our computational graph might need
to have more or fewer elements.

701
01:10:25,716 --> 01:10:29,920
So that's one kind of common
application of dynamic graphs.

702
01:10:29,920 --> 01:10:36,377
For those of you who took CS224N last quarter,
you saw this idea of recursive networks

703
01:10:36,377 --> 01:10:47,337
where sometimes in natural language processing you might, for example, compute a parsed tree of a
sentence and then you want to have a neural network kind of operate recursively up this parse tree.

704
01:10:47,337 --> 01:10:56,856
So having a neural network that kind of works, it's not just a sequential sequence of layers, but
instead it's kind of working over some graph or tree structure instead where now each data point

705
01:10:56,856 --> 01:10:58,732
might have a different
graph or tree structure

706
01:10:58,732 --> 01:11:05,714
so the structure of the computational graph then kind of mirrors the structure
of the input data. And it could vary from data point to data point.

707
01:11:05,714 --> 01:11:10,316
So this type of thing seems kind of complicated
and hairy to implement using TensorFlow,

708
01:11:10,316 --> 01:11:14,887
but in PyTorch you can just kind of use like normal
Python control flow and it'll work out just fine.

709
01:11:16,574 --> 01:11:23,678
Another bit of more research application is this really cool idea that
I like called neuromodule networks for visual question answering.

710
01:11:23,678 --> 01:11:31,737
So here the idea is that we want to ask some questions about images where
we maybe input this image of cats and dogs, there's some question,

711
01:11:31,737 --> 01:11:43,594
what color is the cat, and then internally the system can read the question and that has these different
specialized neural network modules for performing operations like asking for colors and finding cats.

712
01:11:43,594 --> 01:11:49,838
And then depending on the text of the question, it can
compile this custom architecture for answering the question.

713
01:11:49,838 --> 01:11:55,094
And now if we asked a different question,
like are there more cats than dogs?

714
01:11:55,094 --> 01:12:03,076
Now we have maybe the same basic set of modules for doing things like finding
cats and dogs and counting, but they're arranged in a different order.

715
01:12:03,076 --> 01:12:07,716
So we get this dynamism again where different data points
might give rise to different computational graphs.

716
01:12:07,716 --> 01:12:12,574
But this is a bit more of a research thing
and maybe not so main stream right now.

717
01:12:12,574 --> 01:12:19,214
But as kind of a bigger point, I think that there's a lot of cool, creative
applications that people could do with dynamic computational graphs

718
01:12:19,214 --> 01:12:23,471
and maybe there aren't so many right now, just
because it's been so painful to work with them.

719
01:12:23,471 --> 01:12:30,596
So I think that there's a lot of opportunity for doing
cool, creative things with dynamic computational graphs.

720
01:12:30,596 --> 01:12:34,078
And maybe if you come up with cool ideas,
we'll feature it in lecture next year.

721
01:12:34,078 --> 01:12:39,854
So I wanted to talk very briefly about Caffe
which is this framework from Berkeley.

722
01:12:39,854 --> 01:12:48,815
Which Caffe is somewhat different from the other deep learning frameworks where you
in many cases you can actually train networks without writing any code yourself.

723
01:12:48,815 --> 01:12:53,214
You kind of just call into these pre-existing binaries,
set up some configuration files and in many cases

724
01:12:53,214 --> 01:12:56,697
you can train on data without
writing any of your own code.

725
01:12:56,697 --> 01:13:03,054
So, you may be first, you convert your data into
some format like HDF5 or LMDB and there exists

726
01:13:03,054 --> 01:13:08,638
some scripts inside Caffe that can just convert like folders
of images and text files into these formats for you.

727
01:13:08,638 --> 01:13:19,934
You need to define, now instead of writing code to define the structure of your computational graph,
instead you edit some text file called a prototxt which sets up the structure of the computational graph.

728
01:13:19,934 --> 01:13:30,875
Here the structure is that we read from some input HDF5 file, we perform some inner product,
we compute some loss and the whole structure of the graph is set up in this text file.

729
01:13:30,875 --> 01:13:35,956
One kind of downside here is that these files
can get really ugly for very large networks.

730
01:13:35,956 --> 01:13:44,253
So for something like the 152 layer ResNet model, which by the way was trained
in Caffe originally, then this prototxt file ends up almost 7000 lines long.

731
01:13:44,253 --> 01:13:51,817
So people are not writing these by hand. People will sometimes
will like write python scripts to generate these prototxt files.

732
01:13:51,817 --> 01:13:53,275
[laughter]

733
01:13:53,275 --> 01:13:58,974
Then you're kind in the realm of rolling your own computational graph
abstraction. That's probably not a good idea, but I've seen that before.

734
01:13:58,974 --> 01:14:07,497
Then, rather than having some optimizer object, instead there's
some solver, you define some solver things inside another prototxt.

735
01:14:07,497 --> 01:14:11,036
This defines your learning rate,
your optimization algorithm and whatnot.

736
01:14:11,036 --> 01:14:17,278
And then once you do all these things, you can just run the Caffe
binary with the train command and it all happens magically.

737
01:14:17,278 --> 01:14:21,294
Cafee has a model zoo with a bunch of
pretrained models, that's pretty useful.

738
01:14:21,294 --> 01:14:25,438
Caffe has a Python interface but
it's not super well documented.

739
01:14:25,438 --> 01:14:31,455
You kind of need to read the source code of the python interface to
see what it can do, so that's kind of annoying. But it does work.

740
01:14:31,455 --> 01:14:40,174
So, kind of my general thing about Caffe is that it's maybe good
for feed forward models, it's maybe good for production scenarios,

741
01:14:40,174 --> 01:14:42,796
because it doesn't depend on Python.

742
01:14:42,796 --> 01:14:47,358
But probably for research these days, I've
seen Caffe being used maybe a little bit less.

743
01:14:47,358 --> 01:14:51,417
Although I think it is still pretty commonly
used in industry again for production.

744
01:14:51,417 --> 01:14:54,410
I promise one slide, one
or two slides on Caffe 2.

745
01:14:54,410 --> 01:14:58,596
So Caffe 2 is the successor to
Caffe which is from Facebook.

746
01:14:58,596 --> 01:15:02,432
It's super new, it was
only released a week ago.

747
01:15:02,432 --> 01:15:04,436
[laughter]

748
01:15:04,436 --> 01:15:09,314
So I really haven't had the time to form a
super educated opinion about Caffe 2 yet,

749
01:15:09,314 --> 01:15:12,318
but it uses static graphs
kind of similar to TensorFlow.

750
01:15:12,318 --> 01:15:17,817
Kind of like Caffe one the core is written
in C++ and they have some Python interface.

751
01:15:17,817 --> 01:15:21,518
The difference is that now you no longer need to write
your own Python scripts to generate prototxt files.

752
01:15:21,518 --> 01:15:29,657
You can kind of define your computational graph structure all in Python,
kind of looking with an API that looks kind of like TensorFlow.

753
01:15:29,657 --> 01:15:34,596
But then you can spit out, you can serialize this
computational graph structure to a prototxt file.

754
01:15:34,596 --> 01:15:38,676
And then once your model is trained and whatnot, then
we get this benefit that we talked about of static

755
01:15:38,676 --> 01:15:43,534
graphs where you can, you don't need the original
training code now in order to deploy a trained model.

756
01:15:43,534 --> 01:15:49,417
So one interesting thing is that you've seen
Google maybe has one major deep running framework,

757
01:15:49,417 --> 01:15:53,761
which is TensorFlow, where Facebook
has these two, PyTorch and Caffe 2.

758
01:15:54,596 --> 01:15:57,252
So these are kind of
different philosophies.

759
01:15:57,252 --> 01:16:02,847
Google's kind of trying to build one framework to rule them all
that maybe works for every possible scenario for deep learning.

760
01:16:02,847 --> 01:16:07,852
This is kind of nice because it consolidates all efforts onto
one framework. It means you only need to learn one thing

761
01:16:07,852 --> 01:16:13,772
and it'll work across many different scenarios including like distributed
systems, production, deployment, mobile, research, everything.

762
01:16:13,772 --> 01:16:15,706
Only need to learn one framework
to do all these things.

763
01:16:15,706 --> 01:16:18,151
Whereas Facebook is taking a
bit of a different approach.

764
01:16:18,151 --> 01:16:26,071
Where PyTorch is really more specialized, more geared towards research so
in terms of writing research code and quickly iterating on your ideas,

765
01:16:26,071 --> 01:16:32,951
that's super easy in PyTorch, but for things like running in production,
running on mobile devices, PyTorch doesn't have a lot of great support.

766
01:16:32,951 --> 01:16:37,710
Instead, Caffe 2 is kind of geared toward
those more production oriented use cases.

767
01:16:39,567 --> 01:16:47,350
So my kind of general study, my general, overall advice about like
which framework to use for which problems is kind of that both,

768
01:16:47,350 --> 01:16:53,510
I think TensorFlow is a pretty safe bet for just
about any project that you want to start new, right?

769
01:16:53,510 --> 01:16:58,849
Because it is sort of one framework to rule them
all, it can be used for just about any circumstance.

770
01:16:58,849 --> 01:17:05,207
However, you probably need to pair it with a higher level wrapper
and if you want dynamic graphs, you're maybe out of luck.

771
01:17:05,207 --> 01:17:13,190
Some of the code ends up looking a little bit uglier in my opinion, but maybe
that's kind of a cosmetic detail and it doesn't really matter that much.

772
01:17:13,190 --> 01:17:15,809
I personally think PyTorch
is really great for research.

773
01:17:15,809 --> 01:17:21,233
If you're focused on just writing research
code, I think PyTorch is a great choice.

774
01:17:21,233 --> 01:17:25,649
But it's a bit newer, has less community support, less
code out there, so it could be a bit of an adventure.

775
01:17:25,649 --> 01:17:29,969
If you want more of a well trodden path,
TensorFlow might be a better choice.

776
01:17:29,969 --> 01:17:34,710
If you're interested in production deployment, you
should probably look at Caffe, Caffe 2 or TensorFlow.

777
01:17:34,710 --> 01:17:41,270
And if you're really focused on mobile deployment, I think
TensorFlow and Caffe 2 both have some built in support for that.

778
01:17:41,270 --> 01:17:47,393
So it's kind of unfortunately, there's not just like one global best
framework, it kind of depends on what you're actually trying to do,

779
01:17:47,393 --> 01:17:52,045
what applications you anticipate but theses are
kind of my general advice on those things.

780
01:17:53,169 --> 01:17:55,691
So next time we'll talk about some case studies of
various of CNN architectures.